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Welcome  to  the  third  annual  NASA  Symposium  on  VLSI  Design,  co-sponsored  by  the 
IEEE.  Each  year  this  symposium  is  organized  by  the  NASA  Space  Engineering  Research 
Center  (SERC)  at  the  University  of  Idaho  and  is  held  in  conjunction  with  a quarterly 
meeting  of  the  NASA  Data  System  Technology  Working  Group  (DSTWG).  One  task  of 
the  DSTWG  is  to  develop  new  electronic  technologies  that  will  meet  next  generation 
electronic  data  system  needs.  The  symposium  provides  insights  into  developments  in  VLSI 
and  digital  systems  which  can  be  used  to  increase  data  systems  performance. 

The  NASA  SERC  is  proud  to  offer,  at  its  third  symposium  on  VLSI  design,  presen- 
tations by  an  outstanding  set  of  individuals  from  national  laboratories,  the  electronics 
industry  and  universities.  These  speakers  share  insights  into  next  generation  advances 
that  will  serve  as  a basis  for  future  VLSI  design. 

Interest  in  the  conference  has  increased  with  46  papers  in  8 categories  included  in 
this  years  proceedings.  National  Laboratories  are  represented  by  Lawrence  Livermore 
Laboratory  and  the  Johns  Hopkins  University  Applied  Physics  Laboratory.  Private  indus- 
try is  represented  by  Hewlett  Packard-CTG,  Hewlett  Packard-ICBD,  Advanced  Hardware 
Architectures,  Smith  International  Inc.,  and  United  Technologies  Microelectronics  Cen- 
ter. Universities  are  represented  by  Brigham  Young  University,  Montana  State  University, 
Washington  State  University,  University  of  Calgary,  University  of  Western  Australia,  Uni- 
versity of  Houston,  Stanford  University,  Ecole  Polytechnique  de  Montreal,  Concordia  Uni- 
versity, University  of  California  at  Davis,  University  of  British  Columbia,  Portland  State 
University,  University  of  Madras,  Old  Dominion  University  and  the  University  of  Idaho.  In 
addition  we  are  happy  to  welcome  a number  of  papers  presented  by  international  authors. 

There  are  individuals  whose  assistance  was  critical  to  the  success  of  this  symposium. 
Barbara  Martin  worked  long  hours  to  assemble  the  conference  proceedings.  Judy  Wood  did 
another  excellent  job  at  coordinating  the  many  conference  activities.  Sterling  Whitaker 
organized  the  symposium.  The  efforts  of  these  professionals  were  vital  and  are  greatly 
appreciated. 

I am  encouraged  by  the  growth  we  have  experienced  in  this  years  symposium  and  look 
for  suggestions  that  will  allow  a better  symposium  next  year.  I hope  you  enjoy  your  stay 
in  Moscow,  Idaho  and  I extend  an  invitation  to  visit  MRC  research  laboratories  during 
the  symposium. 


Gary  K.  Maki 
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Experience  with  Custom  Processors 
in  Space  Flight  Applications 

M.  E.  Fraeman,  J.  R.  Hayes,  D.  A.  Lohr,  B.  W.  Ballard, 

R.  L.  Williams,  and  R.  M.  Henshaw 
Johns  Hopkins  University  Applied  Physics  Laboratory 
Laurel,  Maryland  20723 

Abstract-  APL  has  developed  a magnetometer  instrument  for  a Swedish  satel- 
lite named  Freja  with  launch  scheduled  for  August  1992  on  a Chinese  Long 
March  rocket.  The  magnetometer  controller  utilized  a custom  microprocessor 
designed  at  APL  with  the  Genesil  silicon  compiler.  The  processor  evolved  from 
our  experience  with  an  older  bit-slice  design  and  two  prior  single  chip  efforts. 
The  architecture  of  our  microprocessor  greatly  lowered  software  development 
costs  because  it  was  optimized  to  provide  an  interactive  and  extensible  pro- 
gramming environment  hosted  by  the  target  hardware.  Radiation  tolerance 
of  the  microprocessor  was  also  tested  and  was  adequate  for  Freja’s  mission 
20  kRad(Si)  total  dose  and  very  infrequent  latch-up  and  single  event  upset 

events. 


1 Introduction 

The  Johns  Hopkins  University  Applied  Physics  Laboratory  (APL)  has  developed  a micro- 
processor that  is  well  suited  to  one-of-a-kind  embedded  applications  especially  in  satellite 
instrument  control.  The  chip  has  been  qualified  for  use  in  a magnetometer  instrument  for 
the  Swedish  Freja  satellite.  The  processor’s  language  directed  architecture  reduced  Freja 
software  costs  because  the  flight  hardware  served  as  its  own  development  system.  Thus, 
unlike  traditional  interpreted  programming  languages  like  Basic,  Lisp,  or  Smalltalk,  our 
Forth  language  development  system  was  fully  supported  on  the  embedded  flight  proces- 
sor. Performance  was  also  equivalent  or  better  than  that  obtained  by  other  microprocessors 
programmed  in  languages  like  C with  traditional  cross-compilers  and  development  systems. 

Our  experiences  using  Forth  to  program  spacecraft  instrumentation  computers,  and  our 
early  efforts  to  design  a 32-bit  microprocessor  specifically  intended  to  execute  Forth  code 
are  described  in  this  paper.  The  design,  architecture,  and  performance  of  our  most  recent 
version  of  this  microprocessor,  called  the  SC321,  are  summarized  in  Section  4.  Discussion 
of  our  use  of  the  SC32  in  the  Freja  magnetometer  includes  our  efforts  to  qualify  the 
microprocessor  for  space  flight.  Finally,  we  discuss  some  of  the  lessons  we  learned  using  a 
custom  designed  integrated  circuit  in  space  flight  hardware. 

iThe  SC32  has  been  commercially  licensed  by  Silicon  Composers,  Inc.,  Palo  Alto,  Ca.  They  offer  chips, 
board  level  development  systems,  and  support  software. 
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Table  1:  APL  Forth-based  Subsystems  and  Experiments 


Spacecraft 

Subsystem /Experiment 

Launch  Date 

Processor 

MAGSAT 

Attitude  Control 

6/79 

RCA  1802 

DMSP 

Magnetometer 

(classified) 

RCA  1802 

HILAT 

Magnetometer 

6/83 

RCA  1802 

Polar  Bear 

Magnetometer 

6/83 

RCA  1802 

Astro-1 

Ultraviolet  Telescope(HUT) 

12/90 

AMD  2900 

Freja 

Magnetometer 

8/92  (tit) 

SC32 

2 Background 

2.1  Forth 

Forth  has  an  extremely  simple  syntax  so  only  a trivial  parser  is  needed  to  allow  it  to 
run  in  impoverished  hardware  environments.  Lexical  properties  are  also  simple.  Forth 
subroutines,  called  words,  are  delimited  by  spaces.  The  words  themselves  can  consist 
o any  characters  other  than  the  delimiter.  This  simplicity  keeps  the  interpreter  small 
allowing  full  featured  Forth  systems  to  fit  comfortably  in  as  little  as  8 kbytes  of  memory! 

Programming  in  Forth  consists  of  defining  new  words  in  terms  of  existing  words.  The 
new  word  is  incrementally  compiled  and  can  be  invoked  interactively  by  the  programmer. 
Thus,  the  usual  benefits  of  interpreted  languages  are  reaped,  especially  simplified  testing 
and  a resulting  higher  confidence  in  program  correctness. 

2-2  APL  Space  Applications  of  Forth 

Table  1 summarizes  APL’s  experience  with  spacecraft  instrumentation  we  have  developed 
and  programmed  using  Forth. [1]  We  have  also  used  the  language  on  other  projects  including 
ground  support  equipment  and  control  of  laboratory  instrumentation.  Application  tasks 
ranged  from  relatively  simple  data  acquisition  functions  to  control  of  the  complex  space 
shuttle  based  Hopkins  Ultraviolet  Telescope  (HUT)-one  of  three  ultraviolet  telescopes 
(all  programmed  in  Forth)  that  comprised  the  Astro-1  mission  at  the  end  of  1990.  Our 

most  recent  instrument,  a magnetometer  for  the  Swedish  Freja  satellite  will  be  described 
later  in  this  paper. 

Our  earliest  space  flight  applications  were  based  on  the  relatively  simple  RCA  1802 
microprocessor.  But  during  the  early  definition  of  the  HUT  command  and  data  handling 
system  around  1980,  it  became  clear  that  a far  more  powerful  processor  was  needed  to 
satisfy  that  project’s  requirements.  After  exploring  an  architecture  based  on  as  many 
as  four  TI  9900  microprocessors  (the  fastest  microprocessor  qualified  for  space  that  was 
then  available),  we  realized  that  a single  faster  machine  would  have  numerous  advantages. 
The  software  would  be  easier  to  write  and  test,  and  more  importantly,  uniprocessor  code 
and  hardware  would  be  more  flexible  in  the  face  of  evolving  requirements  and  as  system 
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interfaces  were  more  clearly  defined. 

2.3  The  Hopkins  Ultraviolet  Telescope  Processor 

The  AMD  2900  bit-slice  component  family  was  used  to  build  a 16-bit  computer  that  im- 
plemented Forth’s  primitive  operations  directly  in  microcode.  In  the  early  1980s,  this  was 
the  only  way  we  could  build  a single  processor  with  throughput  that  met  our  requirements 
and  that  also  could  be  qualified  for  use  in  space.  Our  bit-slice  processor  was  able  to  com- 
pile and  execute  Forth  interactively,  even  on  the  flight  unit,  without  needing  extensive 
support  tools.  Performance  was  also  very  good  (approximately  500,000  Forth  operations 
per  second)  which  allowed  us  to  design  an  unusually  flexible  software  system.  The  final 
flight  software  required  about  5 person-years  of  development  time  (including  developing 
the  detailed  software  requirements),  contained  29  cooperating  concurrent  processes,  and 
consisted  of  about  12,000  lines  of  Forth  code  and  comments. 

We  gained  valuable  experience  with  Forth  based  computers  while  developing,  using, 
and  flying  the  HUT  processor.  A fast  computer  that  supported  a compact  but  interactive 
and  extensible  software  development  system  on  flight  hardware  had  many  advantages.  It 
encouraged  the  development  of  powerful  yet  flexible  software  while  minimizing  the  costs  of 
writing,  testing,  and  maintaining  that  code.  However,  HUT  also  showed  that  the  64  Kword 
address  space  of  16-bit  machines  was  inadequate  for  larger  embedded  systems.  Towards 
the  end  of  the  development  cycle  flight  processor  memory  became  too  full  to  support  an 
interactive  environment  so  we  had  to  fall  back  on  clumsier  traditional  cross-compiler  based 
methodology. 

3 The  FRISC  Project 

At  the  same  time  our  work  on  HUT  hardware  was  winding  down  in  1984,  we  were  also 
initiating  an  effort  to  develop  experience  in  VLSI  design.  We  combined  our  experience 
in  Forth  computers  and  our  interest  in  VLSI  into  an  effort  to  develop  a 32-bit  Forth 
microprocessor.  During  1985  we  developed  the  processor  architecture  that  we  called  FRISC 
(Forth  Reduced  Instruction  Set  Computer)  and  ported  VLSI  design  tools  developed  at 
several  universities  a 68010  based  workstation. 

3.1  FRISC  1 

By  the  beginning  of  1986,  with  tools  and  architecture  firmly  in  hand,  we  started  detailed 
design  of  a chip  that  implemented  most  of  our  ideas.  This  was  FRISC  1,  the  first  in  a 
series  of  chips  that  evolved  into  the  SC32.  We  targeted  the  4 pm  Silicon  on  Sapphire 
(SOS)  process  then  available  through  MOSIS.  We  selected  SOS  technology  for  several 
reasons.  First,  SOS  is  inherently  immune  to  radiation  induced  latch-up  and  would  thus  be 
a candidate  technology  for  future  integrated  circuits  used  in  flight  systems.  The  absence 
of  active-substrate  junction  capacitance  reduces  load  and  hence  improves  speed.  Circuit 
density  is  improved  because  there  is  no  minimum  p-active — n-active  separation  design  rule. 
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Finally,  on  a more  practical  note,  the  SOS  process  was  available  through  MOSIS  at  no 
cost  as  far  as  the  project’s  budget  was  concerned.  So  the  chance  to  get  experience  with 
a technology  with  significant  benefits  for  chips  intended  for  use  in  space  was  too  good  to 
pass  up. 

Design  of  the  18,000  transistor  chip  was  completed  by  mid-April  1986.  It  easily  fit 
inside  a standard  MOSIS  7.9  mm  x 9.2  mm  pad  frame.  We  used  caesar  for  layout,  lyra 
for  design  rule  checking,  ml  for  functional  simulation,  spice  for  circuit  simulation,  and 
the  usual  collection  of  customized  shell  scripts,  format  translators,  and  system  utilities  for 
coordinating  the  design  team’s  work.  While  the  chips  were  being  fabricated,  we  built  a 
wire  wrapped  Multibus  CPU  board  with  memory  and  a programmable  non-overlapping 
clock  generator  hoard  to  test  our  parts. 

Three  months  later  we  had  our  chips  and  began  to  test  them-  About  half  of  the  parts 
that  were  eventually  delivered  appeared  to  function  except  that  one  data  bit  was  always 
stuck  high.  Unfortunately,  that  specific  bit  was  used  in  the  instruction  set  to  cause  the 
processor  to  output  a value,  so  we  had  no  way  to  inspect  the  contents  of  the  chip’s  registers. 
Microscopic  analysis  later  revealed  a spacing  design  rule  violation  at  the  interface  between 
the  pad  ring  cell  and  the  cell  containing  the  chip’s  interior  logic.  This  error  was  undetected 
because  lyra  flattened  the  layout  of  intersecting  areas  on  adjacent  cells  after  checking  the 
cells  individually.  Our  design  hierarchy  consisted  of  the  pad  ring  is  one  cell  and  afl  the 
other  circuitry  in  a second  cell  completely  enclosed  by  the  ring.  Therefore  the  top  level 
rule  check  flattened  the  entire  design  and  greatly  exceeded  the  maximum  virtual  memory 
space  supported  by  our  host  workstation  so  our  mistake  went  undetected. 

Despite  this  layout  error,  one  chip  was  fully  functional  and  we  were  able  to  demonstrate 
a full  Forth  system  running  on  our  own  custom  32-bit  microprocessor.  But  before  we  could 

submit  a corrected  design,  MOSIS  announced  that  they  would  no  longer  offer  access  to 
SOS.  !T  rr/  ' — “ - 


3,2  FRISC  2 

At  the  beginning  of  1987,  we  started  to  redesign  our  chip  with  the  MOSIS  scalable  (3  pxQ 
to  1.2  pm)  bulk  CMOS  process.  We  also  used  the  magic  layout  editor  instead  of  caesar  but 
still  depended  on  ml  for  switch  level  simulation.  By  April  we  sent  the  layout  to  MOSISTor 
a 20,000  transistor  chip  that  implemented  almost  all  of  our  original  architecture.  The  active 
area  for  this  chip,  designed  with  3 pm  feature  sizes,  was  slightly  smaller  than  the  previous 
version  but  it  still  required  a 7.9  mm  x 9.2  mm  pad  frame.  However,  an  inadvertently 
grounded  substrate  prevented  that  part  from  working.  Using  a combination  of  infrared 
microphotography  and  careful  inspection  of  the  layout  in  the  hot  region  we  eventually 
located  the  error.2  Since  we  made  our  mistake,  a circuit  extractor  called  meztra,  was 
modified  at  the  University  of  Washington  to  specifically  detect  similar  errors.  Apparently 
we  weren’t  the  first,  and  based  on  errors  we’ve  detected  in  other  designs,  not  the  last  group 
to  make  a substrate  connection  error. 

IThis  error  has  since  been  missed  by  dozens  of  students  taking  the  midterm  exam  in  a JHU  VLSI  design 
class. 
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A corrected  layout  was  fabricated  shortly  thereafter  and  was  fully  functional.  The  fixed 
FRISC  2 could  execute  about  2.5  million  Forth  primitives  per  second  (about  five  times 
faster  than  25  MHz  Motorola  MC68020  running  Forth)  and  consumed  150  mW.  However, 
this  performance  was  about  twice  as  slow  as  we  expected  due  to  an  incorrectly  sized  control 
line  driver. 


4 The  SC32 

While  our  efforts  had  eventually  produced  a functional  and  usable  microprocessor,  we  did 
not  reach  our  design  goals  on  first  silicon.  In  fact,  we  felt  that  our  small  team  would  not 
be  able  to  build  chips  much  more  complex  than  FRISC  2 with  the  tools  and  workstations 
we  used  for  that  design.  Furthermore,  full  logical  and  parametric  functionality  would 
probably  be  achieved  only  after  several  fabrication  iterations.  Our  simulations  were  not  as 
thorough  as  we  would  have  liked  since  our  workstation  required  a day  to  complete  a switch 
level  simulation  of  the  execution  of  a few  machine  instructions.  Determining  the  impact 
of  more  than  one  or  two  architectural  alternatives  on  chip  speed  and  area  was  impractical. 
Irregular  structures  such  as  control  logic  were  very  tedious  to  layout.  Minor  changes  in 
control  logic  would  often  result  in  days  of  work  to  resimulate  and  update  the  lay  out.  As 
our  speed  problem  with  FRISC  2 demonstrated,  these  structures  were  also  a likely  source 
of  parametric  as  well  as  functional  errors. 

4.1  Genesil 

Rather  than  waiting  several  years  for  workstation  speeds  to  improve  before  tackling  more 
complex  chip  designs,  we  investigated  commercial  VLSI  design  tools.  Silicon  Compilers 
Inc.  (now  part  of  Mentor  Graphics,  Inc.)  had  just  released  the  Genesil  silicon  compiler. 
This  was  a fully  integrated  set  of  VLSI  tools  that  let  the  user  describe,  implement,  and 
analyze  a design  at  the  block  diagram  level. 

Genesil’s  intended  market  was  logic  designers  with  no  VLSI  experience.  Yet  we  were 
attracted  to  it  because  the  compiler  allowed  a user  to  easily  and  quickly  investigate  the 
implications  of  many  architectural  alternatives.  We  felt  that  the  greatest  improvements 
in  system  performance  could  be  gained  by  optimizing  architecture  while  lower  level  en- 
hancements would  be  of  secondary  importance.  Any  inefficiencies  introduced  by  the  high 
level  design  tool  should  be  more  than  compensated  for  by  the  better  architecture  that  the 
silicon  compiler  would  allow  the  designer  to  develop.  Genesil  also  automated  many  of  the 
most  time  consuming  aspects  of  VLSI  design  so  a small  team  would  be  able  to  tackle  larger 
projects.  Thus  we  hoped  that  Genesil  would  be  the  better  tool  that  would  let  our  small 
team  tackle  larger  designs. 

4.2  SC32  Design 

Genesil  was  installed  at  our  site  by  June  1987,  and  we  started  using  it  to  explore  approaches 
to  implementing  our  Forth  architecture.  We  also  enhanced  our  computer’s  architecture 
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based  on  the  experience  we  gained  on  our  earlier  designs.  The  greater  complexity  that 
Genesil  let  us  tackle  with  the  same  size  team  (2-3  part  time  people)  also  allowed  us  to 
improve  the  architecture.  By  mid-November  we  had  completed  our  Genesil  design  work 
including  thorough  simulations  of  thousands  of  instructions.  But  due  to  delays  in  design 
verification  at  Silicon  Compilers,  our  mask  level  design  wasn’t  delivered  to  the  foundry 
until  February,  1988.  After  an  extra  month  delay  caused  by  problems  with  test  vector 
formats,  we  received  fully  functional  tested  parts  in  May.  The  next  day  we  had  a single 
board  computer  running  an  interactive  Forth  development  system. 

We  consider  this  third  version  of  our  Forth  processor  a complete  success.  It  was  fab- 
ricated with  a 2 fi m epitaxial  CMOS  n-well  process,  contained  35,000  transistors,  and 
consumed  660  mW.  The  die  was  9.9  mm  x 9.6  mm  and  was  packaged  in  an  84  pin  ceramic 
pin  grid  array.  Despite  obvious  inefficiencies  in  the  overall  chip  layout,  the  processor  still 
ran  at  10  MHz.  Because  the  processor  architecture  is  optimized  for  Forth  the  comparatively 
slow  clock  rate  speed  still  executed  8-12  million  primitives  per  second — a throughput  still 
unmatched  by  any  other  32-bit  microprocessor  implementation  of  the  language  of  which 
we  are  aware. 


4.3  Architecture 

The  detailed  architecture  of  the  SC32  has  been  described  elsewhere. [2]  Briefly,  the  machine 
has  a 32  bit  word  address  architecture  and  an  instruction  set  that  can  implement  most 
Forth  primitives  in  a single  instruction.  Flow  control  instructions  specify  an  absolute 
destination  address  and  execute  in  a single  cycle  with  no  delay  slots.  The  machine’s 
register  set  is  organized  into  two  top-of-stack  caches  with  single  cycle  access  within  the 
instruction  set  to  the  top  four  locations  of  each  stack.  These  on-chip  caches  support 
stack  depths  limited  only  by  main  memory  with  overflow  and  underflow  events  handled 
entirely  by  hardware.  Less  that  1%  overhead  is  added  to  typical  Forth  programs  by  our 
approach  to  stack  management.  There  are  up  to  eight  other  utility  and  special  purpose 
registers  allowed.  The  data  path  allows  arithmetic  operations  between  these  registers  to  be 
completed  in  a single  cycle.  A flexible  load/store  instruction  format  transfers  data  between 
registers  and  memory  and  can  also  be  used  to  form  literal  values, 

4.4  Performance 

Measuring  and  comparing  processor  performance  is  always  controversial— especially  for  a 
new  architecture  not  supported  by  commonly  used  languages.  Different  implementations  of 
Forth  are  also  difficult  to  compare  since  there  are  no  commonly  used  benchmark  programs 
written  in  that  language.  Finally,  it  is  only  natural  to  ask  how  a Forth  version  of  a program 
compares  to  an  equivalent  implementation  in  a more  widely  used  language. 

Since  Forth  is  the  only  high  level  language  available  for  the  SC32,  we  took  the  approach 
of  manually  translating  a set  of  small  integer  benchmark  programs  from  C to  Forth.  These 
programs  were  collected  by  the  Computer  Systems  Laboratory  at  Stanford  University  and 
have  since  been  translated  from  their  original  Pascal  into  C.  They  have  been  widely  used 
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to  evaluate  the  performance  of  many  computer  systems.  w 

Because  the  Stanford  programs  are  small,  they  are  generally  considered  toy  bench- 
marks that  provide  overly  optimistic  results  in  comparison  to  similar  tests  made  with 
larger  codes.  But  several  factors  suggest  that  the  translated  benchmark  suite  will  provide 
a conservative  estimate  of  performance  running  large  Forth  programs. 

Merely  translating  these  programs  into  Forth  produced  very  poor  and  uncharacteristic 
Forth  code.  Word  definitions  were  extremely  long  and  difficult  to  debug.  This  meant 
that  the  SC32’s  efficient  call/return  mechanism  was  not  used.  However,  our  measurements 
showed  that  real  Forth  programs  greatly  benefit  from  this  feature  of  the  SC32.  Array  and 
structure  accesses  involved  run  time  calculations  repeated  within  inner  loops  and  unnec- 
essary calculations  were  performed.  There  are  many  optimizations  traditional  compilers 
perform  to  minimize  this  arithmetic.  Writing  the  equivalent  program  in  Forth  exposed 
these  excess  calculations  directly  to  the  programmer.  Thus  the  high  level  Forth  source 
code  would  normally  be  written  to  avoid  these  inefficiencies. 

Finally,  the  algorithms  and  data  structures  used  by  the  Stanford  programs  were  heavily 
influenced  by  traditional  languages.  A version  of  one  of  them,  Towers  of  Hanot , ran  9.6 
times  faster  when  coded  with  data  structures  and  algorithms  better  suited  to  Forth  than 
the  simple  translation  of  the  original  code. 

The  SC32  running  with  a 10  MHz  clock  and  programmed  in  Forth  was  8.4  times  faster 
on  the  Stanford  benchmarks  than  a Vax  11/780  programmed  in  C.  If  the  multiplication 
dominated  intmm  program  is  disregarded,  then  the  SC32  is  9.9  times  faster.  The  SC32  is 
also  19.9  times  faster  than  a 25  MHz  Motorola  MC68020  running  Forth.  If  the  MC68020 
is  programmed  in  C than  the  SC32  is  still  1.4  times  faster.[3] 

Our  goal  was  to  develop  a processor  that  could  deliver  the  benefits  of  an  interpreted 
programming  environment  without  any  performance  penalty.  The  data  we  have  collected 
show  that  this  goal  was  achieved.  Small  Forth  programs  run  at  least  as  fast  on  the  SC32 
as  equivalent  C programs  on  traditional  microprocessors.  Furthermore  it  is  likely  that 
this  relationship  will  become  more  favorable  for  large  programs  due  to  the  SC32’s  efficient 
call/return  mechanism. 

4.5  Applications  of  the  SC32 

Several  different  SC32  based  computers  have  been  built  at  APL.  A simple  single  board 
computer  was  designed  to  demonstrate  the  chip.  That  design  was  later  modified  and  used 
in  telemetry  decommutation  ground  support  equipment  for  the  TOPEX  and  SPINSAT 
radar  altimeter  satellites.  A standalone  computer  system,  including  operating  system  and 
utilities,  based  on  magnetic  bubble  memory  for  mass  storage  was  developed  to  show  the 
benefits  of  self  hosted  embedded  processors  for  the  NASA  Goddard  Space  Flight  Center. 
The  most  complex  SC32  system  we  have  built  is  a VME  bus  CPU  with  full  master/slave 
capability.  It  will  be  used  to  control  a balloon  borne  solar  magnetograph.  These  were 
interesting  projects,  but  it  was  not  until  1989  that  the  Freja  magnetometer  instrument 
gave  us  the  opportunity  to  use  one  of  our  chips  in  space  flight  hardware. 
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Table  2:  Freja  Magnetometer  Requirements  Summary 

“•'Anti-alias  low  pass  filters  for  DC  and  AC  channels 

64  Hz  cutoff  during  normal  rate  (14.3  kbits/sec  allocated  to  our  instrument)  telemetry  opera- 
tions - ...  . . 

- 128  Hz  cutoff  during  high  rate  (28.7  kbits/sec  allocated  to  us)  telemetry  operations 

• Digitize  X,  Y,  Z AC  and  DC  magnetic  field  measurements  to  16  bits 

128  samples/sec  during  normal  rate  telemetry  operations 

- 256  samples/ sec  during  high  rate'  ielemefry  operations 

• Oversample  and  average  X,  Y,  and  Z DC  measurements 

• Anti-alias  filter  one  AC  channel  with  256  Hz  cutoff  and  sample  at  512  samples/sec 

• Computer  amplitude  spectrum  0-256  Hz  for  the  AC  channel  with  512  point  FFT 

• Detect  magnetic  activity  to  trigger  data  collection  in  other  experiments 

• Collect  and  digitize  housekeeping  and  status  data 

• Format  and  output  telemetry 

• Interpret  and  execute  commands 


5 The  Freja  Magnetic  Field  Experiment 

Freja  is  a Swedish  satellite  that  will  be  launched  into  a nearly  polar  orbit  to  study  the 
earth’s  magnetosphere  and  ionosphere.  Experiments  from  Sweden,  Germany,  and  Canada 
will  fly  on  the  satellite  and  the  U.S.  is  represented  by  a magnetic  field  experiment  designed 
and  built  at  APL.  Freja  is  clearly  an  international  effort  with  launch  scheduled  in  August 
1992  as  a “piggyback  payload”  on  a People’s  Republic  of  China  Long  March  rocket  (barring 
significant  changes  in  the  political  situation). 

5.1  Magnetometer  Requirements 

The  magnetometer  uses  the  SC32  to  implement  the  instrument’s  data  acquisition  and 
analysis  system.  Overall  instrument  requirements  are  summarized  in  Table  2. [4} 

The  conventional  approach  to  satisfying  these  requirements  would  include  a switchable 
hardware  anti-aliasing  filter  (for  the  two  different  sample  rates),  a 16-bit  A/D,  and  an  on 
board  computer  for  status  and  housekeeping  tasks.  The  processor  would  be  programmed  in 
its  assembly  language  and  the  code  would  be  cross-assembled  on  a separate  machine.  The 
object  code  would  be  downloaded  to  the  target  hardware  for  debugging  using  in-circuit 
emulators  and  other  support  equipment.  No  data  analysis  would  be  performed  on  the 
satellite  but  would  be  deferred  to  ground  based  postprocessing. 

This  configuration  was  not  feasible  within  the  resources  provided  by  Freja  to  our  mag- 
netometer. There  was  neither  power  nor  enough  circuit  board  space  for  the  switchable 
filters.  Filters  would  also  seriously  degrade  the  noise  floor  of  the  magnetic  field  measure- 
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merits.  Telemetry  bandwidth  precluded  transmission  of  the  512  samples/sec  channel  to 
ground  for  spectral  analysis.  A separate  digital  signal  processing  device  used  to  perform 
this  task  would  exceed  the  available  power  and  board  space.  The  extra  hardware  an 
software  design  tasks  would  also  have  lengthened  our  development  schedule.  Fin  y,  e 
traditional  approach  to  developing  embedded  computer  software  with  cross- development 
tools  and  in-circuit  emulators  was  too  costly  due  to  the  long  edit,  compile,  download,  and 

Cm  Our  magnetometer  overcame  these  problems  by  using  a simple  fixed  hardware  anti- 
aliasing filter,  a 16-bit  A/D  converter,  and  the  SC32  microprocessor.  The  computer 
performs  data  acquisition  and  averaging,  digital  anti-alias  filtering,  FFT  computation, 
telemetry  formatting,  command  interpretation  and  execution,  and  other  instrument  con- 
trol functions.  Software  development  and  debugging  were  performed  interactively  on  the 
actual  target  hardware  in  a high  level  language.  Despite  the  processing  demands  imposed 
by  satisfying  these  requirements  with  software,  the  magnetometer  processor  has  a 50% 
throughput  margin  when  the  SC32  is  driven  at  40%  of  its  maximum  clock  rate. 

Mass  and  power  requirements  were  typical  of  small  satellite  experiments.  e c assis 
was  milled  from  a solid  block  of  magnesium  rather  than  aluminum  and  circuit  cards  were 
hardwired  together  instead  of  using  cable  assemblies.  The  completed  instrument  excluding 
probes  and  boom,  weighed  3.5  kg.  The  entire  instrument  consumed  less  than  3.7  W 
including  DC-DC  converter,  sensor  electronics,  telemetry  subsystem,  and  the  compu  er 

itself. 


5.2  Instrument  Development 

Schedule  and  budget  constraints  were  also  quite  challenging.  The  flight  hardware  and 
software  were  delivered  to  Sweden  in  July  1991,  two  years  after  the  project  was  started. 
We  estimate  that  the  hardware  and  software  were  developed  for  50-75%  lower  cost  than  a 
system  of  equivalent  capability  based  on  a traditional  microprocessor  such  as  the  80C86RH. 
The  cost  savings  were  due  primarily  to  our  use  of  an  interactive  Forth  system  rather  than 
a cross-compiler / assembler  that  would  be  needed  for  the  conventional  processor^  We  also 
have  significant  doubt  that  an  equivalent  instrument  could  be  based  on  the  80C86RH  ue 
to  its  limited  throughput,  even  if  it  were  programmed  entirely  in  assembly  language. 

The  productive  software  development  environment  provided  by  the  SC32  was  a key 
factor  in  quickly  completing  the  instrument.  Forth’s  interactive  capability  greatly  assisted 
hardware  debug  and  and  subsystem  integrations.  The  flight  code  was  extremely  compact, 
in  source  (2500  lines)  as  well  as  object  form  (16  Kwords  including  operating/development 
system).  Small  code  size  was  due  to  two  factors.  First,  our  real  time  scheduler  allowed 
the  program  to  be  organized  into  8 cooperating  tasks.  Each  task  was  simple  and  easily 
programmed  especially  when  compared  to  the  alternative  of  a single  monolithic  piece  of 
code.  Secondly,  Forth’s  extensibility  meant  that  program  size  grew  logarithmically  as 
complexity  increased.  Essentially  Forth  was  used  to  develop  a new  programming  language 
specifically  oriented  to  the  problem  domain.  Therefore  programs  that  solved  tasks  in  that 
domain  were  very  compact.  Because  of  these  characteristics  of  Forth,  one  of  us  (Hayes) 
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Za!  HM*  ^ ^nte  the  magnetometer  flight  software  in  only  two  months.  The  magnetometer 

We  were  the  fi”  t oOh  “d  haS  Si”Ce  h'"1  int'*rated  with  lhe  »«>«  subsystems, 
ere  the  first  of  the  seven  experiments  on  Freja  to  deliver  fuUy  flight-ready  hardware 

and  software  for  satellite  integration.  No  flight  software  changes  have  yet  been  needed. 

5.3  Radiation  Testing 

Much  Of  the  development  of  our  instrument  was  affected  by  considerations  of  the  natural 
radiation  environment  in  Freja’s  600  1cm  x 1700  1cm  high  inclination  orbit.  During  the  two 

LXcedlatT8310”’  rT'  t0  reCeiV'  a *°tal  radiation  do“  of  12  kRad(Si).  Radiation 
induced  latch-up  and  single  event  upset  (SEU)  soft  errors  also  concerned  fit 

two  32  K x 32 Tfpr  nu'1"  b0atd  C°nlainS  *te  SC32,  two  32  K x 32  RAM  modules, 

RAM  aTd  SEPROM  t I S2C64RH  timer  chips,  and  52  SSI/MSI  parts,  fire 

. d f f :'re  " “Se  0lh'r  APL  fligh‘  Pr°grams  had  determined 
hat  their  radiation  characteristics  were  acceptable  in  FiJjaV  orbit.  The  82C54RH  radid- 

w^e  u edTr  8Uaran‘ee,d  bj7‘5  SSI/MSI  logic  from  the  54AC00  family 

from  o ler  APL  Zr  T ""  ^ ^own,  again  due  to  information 

..  f lit  fl.gHt  P™J  ’ to  work  in  our  environment.  We  had  to  establish  the 
radiation  characteristics  of  the  SC32  ourselves. 

5.3.1  Total  Dose 

Zhousf  fC  T ,fabr;cati0n  Io*S  Wete  evaluated  for  total  dose  characteristics  using  our 
n-house  Co  facility  Exposure  was  performed  at  a rate  of  1 kRad(Si)/min  with  bias 

“ “ d dPe  aPP  l°  f0rCe  'he  Palt  “i0  “ kn°Wn  stat'-  current  «as 

monitored  during  exposure.  Component  functionality  whs  assessed  within  1-2  min  after 

ac  ra  la  ion  exposure  using  a standalone  computer  hoard  executing  SC32  diagnostics 
Testing .reared  no  more  than  five  minutes  after  each  exposure  step,  thus  annealing  effect" 
e minimized  and  the  entire  test  was  completed  within  an  hour. 

„„;be,firS‘ 1°*;  obtained  fr°m  <»■'  commercial  licensee,  was  fully  functional  and  within 
parametric  limits  beyond  15  kRad(Si)  for  all  five  parts  tested.  The  mean  total  dose  tote 

ance  of  these  parts  was  19.9  kRad(Si)  with  a variance  of  4.8  kRad(Si).  Full  functionality 

pamfrum  zi after  ^ 31  room 

The  other  part  lot  was  supplied  directly  by  our  foundry  and  had  been  packaged  ac- 
cor  mg  to  Mil-Spec-883B.  Our  reliability  group  performed  a pre-cap  visual  inspection  of 
hese  parts  at  the  foundry  and  found  their  quality  was  excellent  and  that  these  parts  could 
i y be  upgraded  to  higher  reliability  levels  through  APL’s  in-house  testing  and  screening 
procedure^  Unfortunately,  a process  change  to  improve  yield  in  the  two  years  since  the 
first  iot  had  been  built  degraded  total  dose  tolerance.  Three  parts  from  this  lot  all  failed 

1&  / A8  7A^°Te  5 kRad(Sl)  when  tested  with  same  procedures  used  with  the  first 
lot.  An  additional  three  parts  were  exposed  to  1 kRad  with  two  days  between  subsequent 

exposures  o more  nearly  simulate  the  radiation  environment  of  the  Freja  orbit.  These 
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parts  also  failed  at  5 kRad(Si).  Room  temperature  unbiased  annealing  has  only  restored 
functionality  to  two  of  these  six  parts. 

Because  of  the  disappointing  total  dose  behavior  of  the  second  batch  of  parts,  we  were 
forced  to  obtain  our  flight  parts  from  the  first  lot.  Several  factors  allowed  us  to  upgrade 
these  commercial  parts  to  space  flight  quality.  The  positive  report  on  our  foundry’s  quality 
control  was  encouraging,  both  commercial  and  Mil-Spec  parts  were  packaged  in  the  same 
high  quality  ceramic  pin-grid  array  package,  and  all  lots  were  assembled  with  the  same 
equipment  and  personnel  at  the  foundry.  So  commercial  parts  from  the  first  lot  were 
extensively  screened  at  APL  and  passed  all  tests. 

5.3.2  Latch-up  and  SEU 

Radiation  induced  latch-up  and  and  SEU  sensitivity  of  the  flight  part  lot  were  also  eval- 
uated. Initially,  SC32  parts  were  screened  for  latch-up  sensitivity  in  an  in-house  Cf252 
chamber.  This  equipment  exposed  the  die  to  heavy  ions  with  a mean  linear  energy  trans- 
fer (LET)  of  36  Mev-cm2/mg  at  a high  flux  rate.  The  SC32  did  not  latch  during  a 30 
minute  exposure.  Subsequent  work  showed  that  many  other  chip  types  also  did  not  latch 
in  the  Cf252  chamber. 

However,  later  tests  made  at  the  Single  Event  Upset  Test  Facility  of  the  Brookhaven 
National  Laboratory  Tandem  Van  de  Graaff  accelerator  cast  doubt  on  conclusions  about 
latch-up  sensitivity  based  on  Cf252  data.  Using  the  Brookhaven  equipment  we  were  able 
to  gather  both  radiation  induced  SEU  and  latch-up  sensitivity  of  the  SC32.  The  chip  did 
latch-up  with  an  LET  threshold  of  15.6  Mev-cm2/mg  which  corresponds  to  about  1 latch- 
up  per  21  years  in  the  Freja  orbit.  An  SEU  threshold  of  5 Mev-cm2/mg  was  also  observed 
which  was  estimated  to  be  equivalent  to  one  soft  error  every  166  days  in  our  orbit. 

These  radiation  testing  results  led  us  to  add  latch-up  protection  circuitry  to  the  DC-DC 
converter.  If  excessive  current  is  drawn  by  the  SC32,  the  CPU  board  will  be  momentarily 
turned  off  thus  resetting  the  latched  circuitry.  After  power  is  restored  the  computer  will 
resume  normal  processing. 

SEU  events  are  more  difficult  to  detect  and  their  impact  can  be  more  subtle.  An  SEU 
could  disturb  the  program  controlling  the  processor  or  it  could  invalidate  a single  word 
of  science  data.  Because  an  SEU  is  only  expected  every  few  months,  it  represents  only 
a minor  error  in  the  collected  data  and  will  be  ignored.  Program  errors  will  be  detected 
by  a watchdog  timer  that  must  periodically  be  updated.  An  SEU  induced  program  error 
will  most  likely  be  detected  by  a failure  to  properly  access  the  watchdog.  In  response, 
the  watchdog  will  reboot  the  system.  Both  types  of  radiation  induced  error  should  occur 
rarely  enough  that  these  correction  strategies  will  not  significantly  degrade  the  quality  of 
the  magnetometer  data. 


6 Conclusions 

Because  the  SC32  was  originally  designed  as  a research  effort  and  was  only  manufactured 
by  a commercial  foundry,  many  questions  had  to  be  resolved  before  we  could  use  it  a space 
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based  instrument.  Reliability  concerns  were  greatly  reduced  after  a site  visit  to  the  foundry 
showed  excellent  manufacturing  procedures  were  followed.  A thorough  screening  of  parts 
from  the  flight  lot  has  also  added  to  our  confidence  in  the  reliability  of  the  SC32. 

Radiation  tolerance  of  our  chip  was  also  studied.  Early  testing  of  our  prototype  chips 
indicated  they  would  meet  our  needs.  Commercial  versions  of  our  chip  manufactured 
shortly  thereafter  were  fully  evaluated  and  had  acceptable  radiation  tolerance.  However, 
the  foundry  modified  the  manufacturing  process  to  improve  yield  in  the  interval  between 
when  our  prototypes  were  evaluated  and  when  we  ordered  Mil- Spec  chips  for  our  instru- 
ment. This  process  change  had  the  unfortunate  side  effect  of  diminishing  total  dose  tol- 
erance to  unacceptable  levels.  Unless  a foundry  rigorously  controls  those  aspects  of  the 
process  that  impact  radiation  tolerance,  performance  may  vary  significantly  between  lots. 

We  have  shown  that  a Forth  language  directed  microprocessor  with  hardware  and  soft- 
ware optimized  for  embedded  systems  can  significantly  improve  spacecraft  instrumentation. 
Because  of  the  capabilities  of  the  magnetometer’s  computer  based  on  the  SC32,  an  Instru- 
ment of  unprecedented  capability  was  developed  at  far  lower  cost  than  could  otherwise  be 
achieved. 

The  most  important  lesson  we  have  learned  from  this  work  is  that  a custom  integrated 
circuit  of  the  right  architecture  can  deliver  substantial  benefits  even  when  only  one  chip  is 
needed.  System  performance  that  is  unreachable  with  catalog  components  can  be  achieved 
and  qualification  issues  can  be  resolved.  Most  surprisingly,  system  development  costs  can 
be  reduced  by  using  custom  chips.  Savings  from  designing  fewer  circuit  boards,  consum- 
ing less  power,  buying  fewer  expensive  flight  components,  and  most  importantly  greater 
software  productivity  easily  balance  the  additional  costs  of  developing  and  qualifying  the 
right  custom  integrated  circuit. 
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Multi- chip  Modules: 

A High-performance  Packaging  Alternative 

L.  Salmon 

Brigham  Young  University 

Abstract-  Multi-chip  Module  (MCM)  packaging  has  emerged  as  an  important 
technology  for  high-performance  electronic  systems.  Benefits  of  MCMs  in- 
clude: high  IC  packing  density,  low  interconnect  propagation  delay,  excellent 
power  dissipation  characteristics,  and  low  cost.  This  paper  will  review  MCM 
substrate  fabrication,  testing,  and  design.  Major  challenges  for  MCM  imple- 
mentation in  high-performance  systems  will  be  discussed.  Finally,  applications 
of  MCM  technology  to  current  high-end  computer  systems  will  be  reviewed. 
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New  Dynamic  FET  Logic 
and  Serial  Memory  Circuits 
for  VLSI  GaAs  Technology 

A.  G.  Eldin 

Electrical  Engineering  Department, 

The  University  of  Calgary, 

Calgary,  Alberta,  Canada. 

Abstract-  The  complexity  of  GaAs  FET  VLSI  circuits  is  limited  by  the  maximum 
power  dissipation  while  the  uniformity  of  the  device  parameters  determines  the 
functional  yield.  In  this  work,  novel  digital  GaAs  FET  circuits  are  presented 
that  eliminate  the  dc  power  dissipation,  reduce  the  area  to  50%  of  that  of 
the  conventional  static  circuits  and  its  larger  tolerance  to  device  parameters 
variations,  results  in  higher  functional  yield. 


1 Introduction 

GaAs  technology  is  used  in  the  fabrication  of  ultra  fast  digital  integrated  circuits.  The 
availability  of  such  circuits  is  critical  for  many  applications  such  as  Gigabit  communication 
systems  and  super  fast  computers  [1].  The  GaAs  FET  is  fundamentally  different  from  both 
the  MOSFET  and  the  bipolar  transistor.  Figure  1 highlights  these  differences. 


□ 


GaAs  FET  I Q Bipolar 


□ MOSFET 


□ Voltage  driven 


□ 


Current  driven 


□ Voltage  driven 


□ Low  Rjn 


□ LowRjn 


□ HighRip 


□ Vj  is  Clamped 

to  (0.6  - 0.7)  volts 
Vin  is  the  driving 

signal 


□ V-  is  Clamped 

But  lB  is  the 
driving  signal 


□ Vj  is  limited  by 

the  gate  oxide 
breakdown  voltage 


Figure  1:  Fundamental  differences  between  devices 


Because  of  these  differences,  the  bipolar  and  MOS  logic  families  and  circuit  techniques 
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Figure  2:  Static  GaAs  FET  logic  families 

Because  of  these  differences,  the  bipolar  and  MOS  logic  families  and  circuit  techniques 
will  be  less  successful  if  applied  directly  without  modification  to  the  GaAs  technology. 
Figure  2 shows  the  most  commonly  used  static  GaAs  FET  logic  families  [1]. 

In  any  of  the  two  logic  state,  al  these  circuits  will  dissipate  dc  power.  This  dc  compo- 
nent accounts  for  90%  of  the  totalpower  [2].  The  power  dissipation  ISits  the  maximum 
number  of  gates  to  15,000  assuming  a maximum  allowable  chip  power  of  5 watts  [21.  On 
the  other  hand,  the  ratio  of  the  threshold  voltage  variation  to  the  noise  margin  is  critical 
for  determining  the  IC  electrical  yield.  It  is  shown  that  if  the  threshold  voltage  variance 
is  changed  from  90  mv  to  150  mv,  the  circuit  size  should  be  reduced  from  10,000  to  100 
gates  to  maintain  50%  yield  [2].  This  illustrates  that  the  threshold  voltage  must  be  tightly 
controUed  for  acceptable  yield.  In  this  paper,  a novel  circuit  technique  [3]-[6]  is  applied  to 
GaAs  HFET  technology  to  overcome  these  two  main  limitations.  The  new  memory  and 

°8jC  CircU*ts  not  dissipate  any  dc  power,  are  less  sensitive  to  threshold  voltage  variation 
and  have  very  small  size. 
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2 The  D-Type  Flip  Flop 

An  intermediate  stage  of  a dynamic  shift  register  is  shown  on  Figure  3 a.  The  D-type  flip 
flop  uses  depletion  type  transistors  with  threshold  voltage  of  (-0.7  volts).  Figure  3 b shows 
the  clock  and  input  waveforms  and  the  logic  levels. 


Master  Slave 


(a)  (b) 

Figure  3:  M/S  dynamic  D-type  Flip  Flop  (intermediate  stage) 

During  tu  Phi2  = 0 volts.  The  master  section  is  in  the  sample  phase.  The  input  data 
is  stored  on  the  capacitor  C\.  It  is  charged  to  2 volts  or  discharged  to  0 volts  for  Vin 
being  logical  1 or  0 respectively.  The  capacitor  C2  is  precharged  to  approximately  3.5  volts 
through  J2  (Phil  = -3.5  volts).  The  slave  section  is  in  the  evaluation  phase.  Transistor 
J3  is  cut  off  and  the  drain  voltage  of  J2,  VD2  = 0 volts,  thus  providing  a reference  voltage 
for  evaluating  the  stored  data  on  C3.  If  (73  is  charged,  J4  is  turned  off  and  the  precharged 
capacitor  C4  retains  its  voltage  to  represent  logic  1.  However,  if  C3  is  discharged,  J4  is 
turned  on  and  C4  is  discharged  to  represent  logic  0.  During  <2,  the  roles  of  the  master  and 
slave  sections  are  interchanged.  Figure  4 shows  the  output  stage  of  the  shift  register. 

The  capacitor  CA  is  replaced  by  a pull  up  device  for  interfacing  with  static  logic  (DCFL). 
The  simulation  results  for  the  output  stage  are  shown  in  Figure  5.  The  device  model 
accounts  for  the  second  order  effects  and  is  accurately  calibrated  to  a 1 um  HFET  process. 
Figure  5(a)  shows  that  V^t  is  delayed  by  one  clock  period  with  respect  to  V;„  which  verifies 
the  operation  of  the  D-Type  flip  flop  at  2 GHZ.  Figure  5(b)  shows  the  waveforms  VDX,  VD2 
and  Vd3  which  correspond  to  the  drain  voltage  of  Ji,  J2  and  J3  respectively. 

Table  1 compares  the  dynamic  and  static  (DCFL)  implementations  of  the  D-Type  flip 
flop. 

Each  section  of  the  DCFL  (M/S)  Flip  Flop  uses  two  inverters  for  the  static  memory 
cell,  two  depletion  transistors  (clocked  transmission  gates)  and  two  DCFL  super  butters, 
as  shown  in  Figure  6,  to  properly  buffer  the  memory  cell  from  the  direct  and  capacitive 
coupling  caused  by  the  clock  signals  driving  the  transmission  gates.  This  DCFL  imple- 
mentation requires  24  transistors  of  both  depletion  and  enhancement  types.  Table  1 shows 
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Figure  4:  M/S  dynamic  D-type  Flip  Flop  (Output  stage) 


Dynamic  FF 

Static  FF 

Power 

0.4  mw 

4 mw 

Number  of 
devices 

4 transistors 
4 capacitors 

24  trnsistors 

Relative  area 

0.3 

i 

Noise  margin 
(NM) 

500  mv 

200  mv 

Table  1:  Dynamic  and  static  implementations  of  the  D-Type  flip  flop 
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Figure  5:  Voltage  waveforms  of  the  M/S  D-Type  Flip  Flop 
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Figure  6:  DCFL  implementation  of  one  section  of  the  M/S  D-Type  Flip  Flop 

that  the  dynamic  circuit  has  significant  savings  in  both  power  dissipation  and  area.  The 
larger  functional  yield  can  be  measured  by  the  ratio  of  the  threshold  voltage  variation  to 
t e noise  margin.  Figure  5 shows  that  VDl  swings  between -1.2  and  0 volts  with  a threshold 
voltage  of  -0.7  volts.  This  relatively  large  noise  margin  makes  the  circuit  operation  less 
sensitive  to  threshold  voltage  variations  and  results  in  a larger  functional  yield. 

3 Logic  circuits  implementation 

The  basic  dynamic  circuit  can  also  be  used  to  implement  the  AND,  OR,  and  complex  logic 
functions.  The  operation  of  the  basic  circuit  is  similar  to  that  of  the  dynamic  flip  flop. 
Figure  7 summarizes  the  operation. 
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Figure  7:  Basic  dynamic  logic  circuit 
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Figure  8:  The  dynamic  AND  function 

When  the  input  is  logic  0,  the  capacitor  C is  discharged  during  the  sampling  phase  and 
the  transistor  J2  will  turn  on  during  the  evaluation  phase.  Similarly,  if  the  input  is  logical 
1,  the  capacitor  is  charged  during  the  sampling  phase  causing  J2  to  turn  off  during  the 
evaluation  phase. 

An  AND  function  is  realized  by  connecting  two  cells  in  parallel  as  shown  in  Figure  8. 
During  the  evaluation  phase,  the  output  will  remain  charged  (representing  logic  1)  if  both 
Ji  and  J2  are  turned  off.  This  corresponds  to  A = B = 1 during  the  sampling  phase.  Any 
other  combination  for  the  values  of  the  inputs  A and  B will  result  in  at  least  one  of  Ji  or 
J2  being  turned  on  and  causing  the  output  to  correspond  to  logic  0. 

The  OR  function  is  implemented  by  connecting  cells  in  series  as  shown  in  Figure  9. 
During  the  evaluation  phase,  the  output  will  be  discharged  only  if  both  J\  and  J2  are 
turned  on.  This  corresponds  to  A = B = 1 during  the  sampling  phase.  If  A or  B is  logic 
1,  at  least  one  of  the  transistors  Jx  or  J2  will  be  turned  off  during  the  evaluation  phase. 
This  causes  the  output  to  remain  charged  and  correspond  to  logic  1. 

Complex  logic  gates  can  be  realized  by  parallel  and  series  connections  of  the  basic 
circuit  as  shown  in  Figure  10.  In  this  example  the  output  F will  remain  charged  during 
the  evaluation  phase  if  either  (C  and  D)  are  logic  1 during  the  sampling  phase  or  (A  and 
B)  are  logic  1.  This  ensures  that  there  is  no  discharge  path  from  the  output  to  ground 
during  the  evaluation  phase. 

It  is  seen  that  these  logic  circuits  do  not  dissipate  any  dc  power.  Also,  since  only  one 
type  (depletion)  of  transistors  is  used,  the  circuits  are  less  sensitive  to  process  and  threshold 
voltage  variations.  It  is  noted  that  when  the  clocked  depletion  transistors  are  turned  on, 
VGS  — 0 volts  and  the  transistors  do  not  draw  any  gate  current.  VGS  can  be  increased 
to  about  0.4  volts,  which  will  keep  the  gate  current  negligibly  small  while  increasing  the 
driving  capability  of  the  clocked  transistors  and  the  noise  margin  of  the  circuit.  This  will 
also  enhance  the  operating  speed. 
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Figure  9:  The  dynamic  OR  function 


F = AB  + CD 


Figure  10:  The  dynamic  Complex  Logic  gate 
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4 Conclusions 

The  new  dynamic  circuits  eliminate  the  dc  power  and  have  large  noise  margin  and  small 
size.  Compared  to  the  static  implementation  using  the  DCFL,  the  use  of  the  dynamic 
circuits  results  in  (50-70)%  reduction  in  the  area.  The  noise  margin  and  therefore  the 
electrical  functional  yield  is  increased  by  a factor  of  2.5  and  the  total  power  dissipation  is 
reduced  by  90%  at  a switching  speed  of  2 GHZ.  These  significant  improvements  allow  an 
order  of  magnitude  higher  level  of  integration  with  acceptable  functional  yield  and  power 
dissipation. 
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Abstract  - This  paper  presents  an  ILA  architecture  for  synchronous  sequential 
circuits.  This  technique  utilizes  linear  algebra  to  produce  the  design  equations. 
The  ILA  realization  of  synchronous  sequential  logic  can  be  fully  automated 
with  a computer  program.  A programmable  design  procedure  is  proposed  to 
fulfill  the  design  task  and  layout  generation.  A software  algorithm  in  the  C 
language  has  been  developed  and  tested  to  generate  1 um  CMOS  layouts  using 
the  Hewlett-Packard  FUNGEN  module  generator  shell. 

1 Introduction 

The  design  of  sequential  circuits  presents  a major  task  for  most  digital  systems.  As  Very 
Large  Scale  Integrated  (VLSI)  technology  advances,  developing  an  architecture  to  maxi- 
mize the  efficiencies  of  all  the  design  steps  becomes  a major  goal  in  the  research  of  sequential 
circuit  design. 

This  paper  introduces  the  Iterative  Logic  Array  (ILA)  as  a new  architecture  for  syn- 
chronous sequential  circuits.  This  architecture  realizes  a sequential  circuit  by  replicating 
simple  basic  modules.  With  an  ILA  architecture,  a sequential  machine  can  be  built  into  a 
very  regular  form  automatically  by  a computer  program  with  a single  type  of  ILA  mod- 
ule. The  simplicity  and  programmability  of  the  ILA  architecture  significantly  reduce  the 
design  task  in  all  stages  of  VLSI  implementation,  from  logic  design,  circuit  design,  artwork 
generation  to  verification. 

2 ILA  Architecture 

Iterative  Logic  Arrays  (ILA)  have  been  described  in  the  literature  for  quite  some  time  [1,2]. 
An  ILA  circuit  consists  of  an  array  of  identical  cells.  Generally,  as  shown  in  Figure  1,  each 
ILA  cell  contains  two  sets  of  input  signals.  One  set  of  inputs  are  applied  in  parallel,  while 
the  other  set  of  inputs  are  driven  by  adjacent  cells.  Signals  normally  propagate  in  only 
one  direction  between  cells,  and  outputs  are  derived  only  from  the  serial  outputs  of  the 
last  cell. 

In  an  ILA  architecture  for  sequential  circuits,  the  next  state  of  each  state  variable  is 
generated  by  a slice  of  concatenated  ILA  cells.  A sequential  network  is  then  constructed 
by  placing  the  ILA  slices  side  by  side. 
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Figure  1:  A slice  of  ILA  circuit 
5,  5.  . 
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Figure  2:  Pass  transistor  2-to-l  MUX 

The  basic  cell  of  an  ILA  sequential  network  consists  of  a 2-to-l  multiplexer  (MUX)  apd 
a next  state  forming  logic.  A MUX  cell  has  a select  line  S,  its  complement  3?  and  two  data 
inputs  Iq  and  Ilt  and  a logic  function  defined  by  Equation  1. 

Q = S * Ii  * I0  ^2) 

The  simplest  way  to  implement  the  MUX  function  is  to  use  a pass  transistor  circuit. 
Basically,  the  pass  transistor  MUX,  excluding  level  restoration  logic,  is  a module  of  two 
pass  transistors,  which  functions  as  two  simple  switches.  Figure  2 shows  the  circuit  of 
two  inputs  ^ and  I0  and  one  output  Q controlled  by  two  control  lines  5 and  ^ which 
are  assumed  to  be  asserted  exclusively  such  that  only  one  of  two  inputs  j,  and  I0  can  be 
passed  to  Q at  a given  time. 

Some  details  in  pass  transistor  transmission  characteristics  are  omitted  here.  Design 
considerations,  such  as  level  restoration,  are  assumed  to  be  handled  by  the  output  buffers. 
The  circuit  design  considerations  have  been  discussed  in  [3,4,5]. 


3 Operational  Function 

Tn  this  research,  the  one-hot-code  is  utilized  as  the  state  assignment  for  a synchronous  flow 
table.  With  the  one-hot-code  assignment,  there  is  a unique  state  variable  corresponding 
to  each  state.  That  makes  it  possible  to  express  the  design  function  using  the  states  in 
t e flow  table  explicitly.  A new  form  of  mathematical  expression  is  proposed  next  which 
describes  a flow  table  directly  by  flow  table  states. 
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Definition  1 The  set  of  operational  functions  is  the  behavior  description  of  a synchronous 
flow  table  of  n rows  and  m columns.  Each  function  is  an  equation  for  a next  state  Si  in 
the  flow  table. 

TJX 

Si  = E Siph  (2) 

P=1 

where  Sip  is  an  OR  function  of  the  states  Sj,  Vji  = 1,  • • • ,n,  which  have  Si  as  the  next  entry 
under  input  Ip. 

It  can  be  shown  that  there  is  a one-to-one  mapping  between  the  next  state  equation 

771 

« = (3) 

P=  1 

and  the  operational  function.  With  the  one-hot-code  state  assignment,  each  r-partition 
can  be  expressed  as 

r = {£;«} 

which  partitions  a single  state  S,  from  the  rest  states  S in  the  flow  table.  The  number 
of  state  variables  is  equal  to  the  number  of  states.  Next  state  ^-partitions  can  be  formed 
using  known  procedures  [6].  If  an  77-partition  rji  is 

TH  = iSiS2--.Si;S} 

then  it  is  well  known  that 

fip  = Vi  + V2  + ■ • * + Vi- 

On  the  other  hand,  Equation  2 

m 

Si  = s'p^p 

P=1 

can  be  mapped  into  a next  state  equation  as  Equation  3 if  the  one-hot-code  assignment 
is  used  where  are  sum  of  the  state  variables  yj  corresponding  to  Sip  in  Equation  2. 
Therefore,  there  is  a one-to-one  mapping  between  Equation  2 and  Equation  3. 

Since  the  operational  function  is  a direct  representation  of  the  flow  table,  they  can  be 
derived  by  inspection.  For  each  state  in  the  next  state  entry,  there  is  a product  term  of 
the  present  state  and  input  state  in  the  operational  function.  If  a synchronous  machine 
is  specified  by  a state  diagram,  the  state  diagram  may  need  to  be  converted  to  a flow 
table,  though  it  will  not  be  too  hard  for  an  experienced  designer  to  derive  the  operational 
functions  from  the  state  diagram  directly. 

Table  1 is  the  flow  table  of  a state  machine  with  four  states.  For  example,  State  Sa 
appears  as  the  next  state  entry  of  states  Sc  and  Sd  under  Ix.  Therefore,  the  operational 
function  for  Sa  is 

Sa  = (Sh  + Sc  + Sd)Ix. 

For  State  S^,  it  appears  as  the  next  state  under  both  Ix  and  1 2.  So  the  operational 
function  for  state  Sb  is 

Sb  = SaIx  + ( Sb  + SC)J2. 
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Table  1:  A synchronous  flow  table  for  the  state  diagram 

The  operational  functions  for  state  Sc  and  Sd  can  be  derived  in  the  same  way.  All 
together,  the  operational  functions  for  Table  1 are  as  follows: 

Sa  = (Sb  + & + Sd)h 

h = Sah  + (Sb  + Sc)l2  (4) 

Sc  = (5„  + Sb  + Sd)I3 

Sj  = (Sa  + Sd)I3  + (Sc  + Sd)I3 

4 ILA  Architecture  for  Synchronous  Sequential  Cir- 
cuits 

A simple  regular  ILA  structure  requires: 

• The  design  equation  is  convertible  to  a pass  logic  function  where  each  control  variable 

passes  a single  pass  variable  or  a constant.  - 

• The  control  variables  are  shared  with  each  pass  logic  function. 

With  such  a structure,  if  the  pass  variables  in  each  equation  are  the  same,  the  signal  bus 
to  each  slice  of  ILA  circuit  can  be  minimized  to  a single  wire. 

If  state  Si  is  used  as  the  control  and  Si  appears  as  a next  state  under  only  one  input 
Ip,  then  Ip  can  be  the  only  pass  variable  in  the  design  equation  for  S,.  For  example,  the 
equation  for  Sa  in  Equation  4 can  be  converted  into  a pass  logic  function  with  input  Jx  as 
a pass  variable:  . , ^ ...  ..  . ; 

Sa  = Sb(ii)  + Sc(ii)  + Sd(ii) 

From  the  definition  of  the  operational  function  in  Equation  2,  if  Si  appears  only  under 
input  h,  then  Equation  2 can  be  rewritten  as: 

Si  — Siplp  (5) 

where  $ip  is  an  OR  function  of  the  states  5*.,  k € {1,2,  • • • ,n}.  Therefore,  Equation  5 can 
be  written  into: 

sl  = ipjrgiksk 

h=i 


(6) 
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Figure  3:  The  general  ILA  structure  for  synchronous  logic 


or  in  a form  of  pass  logic  expression: 

Si  = s^gnl,)  + Tt{-  • • Sk(gik Ip)  + Tk(--- ( Sn(ginIp ) + ^(0)  ■ • •)•  (7) 


where 


if  Si  is  the  next  state  of  Sk  under  Ip 
if  Si  is  not  the  next  state  of  Sk  under  Ip 


Theorem  1 The  architecture  depicted  in  Figure  S is  a proper  model  for  a synchronous 
sequential  circuit . 


Proof:  The  proof  follows  directly  from  one-hot-code  assignment  that  one  and  only  one 
state  variable  are  active  at  a time  and  Equation  6 contains  only  one  input  state.  Clearly, 
the  architecture  realizes  Equation  6 by  placing  a multiplexer  under  Sk  where  gik  = 1 and 
a wire  under  Sk  where  gik  — 0. 

□ 

To  accomplish  the  ILA  structure,  5^  must  be  restricted  to  appear  in  a flow  table  under 
only  one  input  Ip.  If  a state  Si  appears  as  a next  state  under  both  Ip  and  J9,  Si  has  to  be 
split  into  two  different  states.  For  example,  in  Table  1,  state  Si  appears  under  both  input 
Ii  and  I2.  It  is  necessary  to  distinguish  5 & with  two  unique  states  Sbi  and  Sj>2  where  Sbi 
represents  the  under  I\  and  Sb2  represents  the  Sb  under  I2.  Similarly,  state  Sd  needs  to 
be  split  into  S^2  and  Sj3.  A revised  flow  table  can  then  be  obtained  by  splitting  all  states 
under  different  columns.  Table  2 shows  the  result. 

After  updating  the  flow  table,  the  operational  function  for  each  state  can  be  derived 
in  the  same  way  as  before.  For  example,  Sb2  is  the  state  under  J2  only.  Therefore,  its 
operational  function  is 


Sb2  — 0 + Sbl  ^2  + Sb2^2  + SCI2  + 0 + 0. 


24.6 


h h h 
Sa 

Sh 

Sbi 

Sc 

Sd  2 

SdZ 

Table  2;  A revised  flow  table 

All  other  operational  functions  axe  also  in  the  same  form.  The  results  are  shown  as  follows: 

•50  = 0 + Sbili  + Sb2l\  + ScIi  + Sd2li  + Sjzli 

Sbl  — Sail  + 0 4-0  + 0 4“  0 

Sb  2 — 0 + Sbll?  + Sb2l2  + Scl2  + D +0 

Sc  = Sal3  + Sbi Ti  + 0 +0  + o - - — 

Sd2=  Sail  + 0 +0  +0  + Sd2I2+  Sdlh 

Sd3—  0 +0  +0  + ScIs+  5^2/3  + 5^3/3 

Splitting  states  in  a flow  table  allows  all  of  the  pass  variables  in  an  operational  function 
to  be  the  same.  The  disadvantage  of  splitting  states  is  that  it  generates  additional  next 
state  equations.  Increasing  the  number  of  equations  implies  increasing  the  area  in  silicon 
It  is  a trade  off  by  gaining  programmability  and  regularity  of  the  ILA  realization  versus 
cost.  An  automated  sequential  circuit  design  will  significantly  reduce  the  design  effort  and 
speedup  the  process  of  implementation. 

5 The  Matrix  Expression 

The  operational  junctions  discussed  in  previous  sections  can  be  efficiently  expressed  with 
matrices.  The  matrix  will  also  help  to  implement  the  function  m silicon.  With  Equation  6, 
a synchronous  sequential  circuit  can  be  expressed  with  a set  of  equations: 

TJ 

■Si  = Ip  gikSk 

^ — - - : " ' 


Sn  — Iq  'y  ' 9nkSk 

k=l 

Such  a set  of  equations  are  equivalent  to  a matrix  expression: 

S — A x G x S 


3rd  NASA  Symposium  on  VLSI  Design  1991 


2.1.7 


where  matrices  S are  S are  column  vectors 


( SA 

Si  \ 

s = 

S2 

; s = 

S2 

\ Sn  / 

V Sn  j 

matrix  A is  a diagonal  matrix  with  Ip  in  the  ith  row/column  if  the  next  state  Si  is  under 


( u 


and  matrix  G is  defined  as 


G = 


\ 


0 


h) 


9 11  ■ • ■ • 9ln 


\ 9n  1 


9nn  ) 


in  which 


gik  = 


1 if  Si  is  the  next  state  of  S*. 

0 if  Si  is  not  the  next  state  of  S* 


For  example,  the  matrix  expression  for  the  flow  table  in  Table  2 is: 


0 h 0 0 0 0 


Sa  \ 

i 

Sb  i 

Sc 

sdl 

sd2  j 

V 

0 0 h 
0 0 0 
0 0 0 


0 0 0 
h oo 
h 


0 


0 0 0 0 0 


0 

1 


3 / 


( 0 1 1 1 1 l\ 

sa  \ 

1 0 0 0 0 0 

Sbi 

0 1110  0 

Sb2 

1110  0 0 

sc 

1 0 0 0 1 1 

Sn 

^ 0 0 0 1 1 1 j 

Sd2  / 

(9) 


The  matrix  A and  G are  directly  related  to  hardware  structure.  As  in  the  ILA  realiza- 
tion, there  will  be  a slice  of  the  ILA  circuit  for  each  design  equation,  as  shown  in  Figure  3. 
Now  each  element  at  the  diagonal  of  the  matrix  A indicates  the  input  state  to  the  ILA 
slice.  Each  row  of  matrix  G reveals  the  location  of  ILA  cells  in  the  slice.  If  the  element 
gik  is  1,  then  an  ILA  cell  will  be  placed  under  the  control  of  state  Sk  in  the  slice  of  the 
ILA  circuit  for  next  state  5;.  If  gik  is  equal  to  0,  a wire  will  be  placed  in  that  position.  An 
example  of  the  ILA  realization  will  be  shown  in  next  section. 
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6 Design  Procedure 

From  the  discussion  in  the  previous  section,  the  design  of  a synchronous  machine  can  be 
completely  automated  by  programming  an  ILA  cell  or  a wire  into  a pre-interconnected 
layout  floor.  It  allows  the  physical  layout  to  be  designed  and  stored  in  computer  as  a set 
of  building  blocks.  Then  for  each  instance  of  synchronous  sequential  logic,  an  ILA  circuit 
can  be  implemented  by  placing  ILA  cells  according  to  the  corresponding  G matrix.  The 
A matrix  will  indicate  the  interconnection  to  input  states. 

From  the  layout  point  of  view,  a wire  can  be  considered  as  a cell  as  well.  Hence 
there  will  be  two  cell  types  in  an  ILA  realization.  Let  the  ILA  cell  which  performs  the 
multiplexer  function  be  defined  as  a MUX  cell.  Let  a wire  be  defined  as  an  ILA  ZERO 
cell.  Now  the  computer  program  can  search  the  (5  matrix  and  place  a MUX  cell  once  a 
Kl”  is  encountered  or  a ZERO  cefi  once  a “0”  is  encountered.  The  schematic  of  a MUX 
cell  and  a ZERO  cell  are  shown  in  Figure  4 (b)  and  (c)  respectively. 

Procedure  1 Synchronous  ILA  network  design  procedure. 

Step  1.  For  a synchronous  machine  specified  by  a state  diagram,  convert  it  into  a syn- 
chronous flow  table  (state  table). 

Step  2 . If  a state  appears  as  a next  state  under  more  than  one  input  column,  split  the 
state  and  give  a unique  name  to  the  state  under  each  input  column.  Repeat  this  step 
until  all  states  under  one  column  are  distinguished  from  states  under  other  columns. 

Step  3.  Generate  the  A matrix  by  setting  the  diagonal  element  in  the  ith  column  to  be  Ip 
if  state  5,  appears  as  a next  state  in  the  flow  table  under  tp. 

Step  4.  Generate  the  G matrix  such  that  gij  is  “ 1 ” if  Si  is  the  next  state  of  Sj,  gij  = 0 
otherwise. 

Step  5.  Map  the  matrices  to  the  layout  floor.  Place  a MUX  cell  under  the  control  of  Sk 
in  the  slice  of  the  ILA  circuit  for  the  next  state  Si  if  gik  = 1 or  place  a ZERO  cell  if 

9ik  = o; 

Step  6.  Connect  Input- 1 of  the  last  ILA  cell  in  the  slice  of  the  ILA  circuit  for  Si  to  Ip 
which  is  the  diagonal  element  of  matrix  A in  the  ith  row.  Conned  Input-0  of  the  last 
ILA  cell  to  the  level  of  logic  low  (VSS). 

For  example,  for  a synchronous  machine  specified^  a flow  table  shown  in  Table  1,  it 
needs  to  find  those  states  which  are  under  more  than  one  input  state  and  to  split  them. 
The  result  of  splitting  is  shown  in  Table  2.  The  matrices  of  the  flow  table  can  then  be 
generated.  For  instance,  Sa  is^  a next  state  under  I\  of  state  5m,  5m,  5c,  5j2  and  5j3.  Then 
I\  becomes  the  diagonal  element  an  in  matrix  A;  the  row  of  matrix  G will  have  a “0” 
in  the  first  column  since  the  next  state  of  50  under  Ij  is  not  50,  and  have  a “1”  in  the  rest 
of  columns.  The  A matrix  and  G matrix  can  then  be  mapped  into  an  ILA  network.  The 
result  is  shown  in  Figure  4 (a)  where  each  ILA  cell  is  represented  by  a box.  The  boxes  in 
dash  line  represent  the  ZERO  cell  = 0)  and  boxes  in  solid  fine  represent  the  MUX  cell 
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(a)  Synchronous  ILA  network 


(b)  ILA  cell  - mux  (c)  ILA  cell  - zero 


Figure  4:  The  ILA  network  for  the  example 
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(9ij  = 1).  As  the  first  row  of  matrix  G is  “011111”,  the  top  slice  of  the  ILA  for  Sa  consists 
of  one  ZERO  cell  on  the  left  and  five  MUX  cells.  Again,  from  matrix  A,  the  input  of  the 
last  ILA  cell  is  tied  to  I\  and  VSS. 

As  mentioned  before,  a major  advantage  of  the  design  approach  in  Procedure  1 is  that 
it  allows  a hierarchical  layout  design.  The  high  level  layout,  including  interconnections,  is 
identical  for  all  synchronous  flow  tables.  When  the  function  of  a flow  table  changes,  the 
only  thing  one  has  to  do  is  to  instruct  the  computer  to  re-program  the  position  of  MUX 
cell  and  ZERO  cell.  Of  course,  the  input  state  to  each  slice  of  the  ILA  may  need  to  be 
changed  as  well. 

7 Automated  Synchronous  ILA  Design  System 

The  ILA  design  procedure  has  been  coded  into  computer  programs  and  ported  to  Hewlett- 
Packard  FUNGEN  layout  tool.  The  automatic  synchronous  ILA  design  system  consists  of 
an  HP  FUNGEN  shell  and  three  major  subsystems: 

• Sequential  Logic  Processor 

• FUNGEN  Configuration  Code 

• Library  of  Layout  Building  Blocks. 

The  Sequential  Logic  Processor  is  an  ILA  circuit  topology  generator  which  receives  the 
specification  of  synchronous  sequential  machine  and  converts  it  into  a form  specified  by 
FUNGEN  Configuration  Code.  There  are  three  phases  in  implementing  the  Sequential 
Logic  Processor:  flow  table  revising,  matrices  generation  and  TUWRC  formation,  ^feelirst 
two  phases  follows  closely  to  the  step  2,  step  3 and  step  4 in  Procedure  1.  The  third  phase 
is  to  generate  parameters  of  device  modules  pre- defined  by  FUNGEN  Configuration  Code 
and  write  them  into  a FGNRC  file.  By  modifying  the  last  phase,  the  program  can  be 
ported  to  any  other  artwork  generator  systems. 

The  FUNGEN  Configuration  Code  describes  the  artwork  architecture  and  defines  the 
modules  in  the  FGNRC  file.  The  FUNGEN  Configuration  Code  is  written  in  Furigeh 
Configuration  Language  (FCL),  a subset  of  C language  with  a number  of  functions  for 
Hewlett-Packard  TRANTOR  database  generation.  The  overall  ILA  architecture  and  a set 
of  ILA  configuration  modules  are  specified  in  the  FUNGEN  Configuration  Code, 

When  running  the  FUNGEN  shell,  the  system  invokes  the  FUNGEN  Configuration 
Code,  FGNRC  file  and  Layout  Library,  and  automatically  generates  a layout  artwork 
by  placing  pre-designed  ILA  cells  and  peripheral  buffers.  It  also  labels  all  of  blocks  in 
accordance  with  the  FUNGEN  Configuration  Code  and  FGNRC  file.  Figure  5 illustrates 
the  block  diagram  of  the  ILA  design  system  and  the  algorithm  of  Sequential  Logic  Processor 
implement  at  ion . 


min  i mu ii 
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SYNCHRONOUS  FLOW  TABLE 


Figure  5:  Block  diagram  of  the  automatic  ILA  design  system 
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8 Summary 

This  paper  presents  an  ILA  architecture  for  synchronous  sequential  circuits.  The  design 
procedure  is  also  proposed  to  realize  synchronous  sequential  ILA  circuits  by  programming 
the  placement  of  two  basic  cells,  a 2 to  1 multiplexer  or  a cell  of  metal  wires.  The  inter- 
connections between  ILA  cells  is  only  a single  route  line  in  both  the  X and  Y dimension. 
The  simplicity  and  programmability  of  the  procedure  significantly  reduce  the  effort  in  ah 
stages  of  synchronous  sequential  circuit  implementation,  from  logic  design,  circuit  design, 
physical  layout  to  verification.  ^ ! 

The  ILA  design  procedure  utilizes  matrices  expression  to  represent  design  equations. 
One  of  the  advantages  of  using  matrices  is  that  they  directly  indicate  the  placement  of  the 
ILA  cells  in  the  realization.  An  ILA  design  tool  for  synchronous  sequential  circuits  has 
been  implemented  into  a computer  system  which  automatically  generates  layout  artwork 
from  a synchronous  sequential  machine  specification. 
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Abstract:  This  paper  presents  single  phase  dynamic  CMOS  NOR-NOR  PLA  using 
triggered  decoders  and  charge  sharing  techniques  for  high  speed  and  low  power. 
By  using  the  triggered  decoder  technique,  the  ground  switches  are  eliminated, 
thereby  making  this  new  design  much  faster  and  lower  power  dissipation  than 
conventional  PLAs.  By  using  the  charge-sharing  technique  in  a dynamic  CMOS 
NOR  structure,  a cascading  AND  gate  can  be  implemented.  The  proposed 
PLAs  are  presented  with  a delay-time  of  15.95  and  18.05  nsec,  respectively, 
which  compare  with  a conventional  single  phase  PLA  with  35.5  nsec  delay-time. 
For  a typical  example  of  PLA  like  the  Signetics  82S100  with  16  inputs,  48  input 
minterms  (to)  and  8 output  minterms  (n),  the  2-SOP  PLA  using  the  triggered 
2-bit  decoder  is  2.23  times  faster  and  has  2.1  times  less  power  dissipation 
than  the  conventional  PLA.  These  results  are  simulated  using  maximum  drain 
current  of  600  pA,  gate  length  of  2.0  pmy  Vdd  of  5 V,  the  capacitance  of  an 
input  minterm  of  1500  / F,  and  the  capacitance  of  an  output  minterm  of  1500 
/F. 

1 Introduction 

CMOS  technology  has  become  a vital  technology  for  VLSI  because  of  its  high  density  and 
low  power  dissipation.  However,  it  suffers  from  low  speed  due  to  its  inherent  parasitic 
capacitance.  Thus  high-speed  CMOS  techniques  have  been  vigorously  researched.  With 
high  speed  CMOS  it  would  be  possible  to  implement  real-time  digital  signal-processing 
applications.  As  a result,  an  advanced  CMOS  technology  is  expected  to  remain  at  the 
forefront  of  VLSI  technology  for  many  years  to  come.  The  problem  in  designing  VLSI 
systems  is  of  enormous  complexity.  This  problem  can  be  partially  simplified  by  using  the 
more  general  design  principle  of  a PLA  which  provides  a regular  structure.  PLAs  are 
also  attractive  to  the  VLSI  designer  because  their  structure  requires  a minimum  number 
of  separate  cell  designs,  and  allows  for  ease  in  testing  while  offering  the  opportunity  for 
simple,  rapid  expandability  [2]. 

The  delay-time  of  a dynamic  CMOS  NOR  gate  increases  very  slowly  with  increasing 
number  of  inputs  [7].  By  cascading  two  stages  of  multi-input  NOR  gates,  any  desired 
Boolean  function  of  the  input  variables  can  be  generated.  A CMOS  dynamic  PLA  makes 
use  of  these  two  properties.  In  CMOS  technology  the  use  of  PLAs  has  continued,  mainly 
for  regularity  of  layout  and  ease  of  code  modification  [6].  PLA-based  nMOS  processors 
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are  superior  in  power  dissipation  to  random-logic  based  machines  by  a significant  margin, 
although  they  are  slightly  larger  in  area  [1],  Dynamic  CMOS  PLAs  are  often  used  to 
generate  state  vectors  for  lower  power  microprocessors  rather  than  nMOS  PLAs.  However, 
the  use  of  ground  switches  and  multiple  phases  reduces  the  speed  of  PLA  using  conventional 
dynamic  CMOS  technology  [7,8]. 

The  fastest  conventional  single  phase  dynamic  CMOS  NOT-NOR-NOR-NOT  PLA  has 
large  noise  spikes  at  the  floating  nodes  [3,4],  Hence,  a four  phase  dynamic  CMOS  NOT- 
NOR-NOR-NOT  PLA  [8]  is  commonly  adopted.  The  problem  of  this  four  phase  scheme 
is  that  complex  clocks  must  be  generated  to  drive  the  dynamic  logic  circuits.  This  also 
reduces  speed  and  requires  larger  interconnection  area.  A low-power  NANP-NOT-NOR- 
NOT  PLA  using  a simplified  addressing  schemeds  proposed  with  244  ns  per  instruction 


for  a signal  processor  in  paper  [4].  *"  r vrst*~ 

In  this  paper,  a family  of  PLAs  using  triggered  decoders  and  charge  sharing  techniques 
is  proposed.  Thej  are  single  phase  dynamic  CMOS  NQT-NQR-NOR-NOT  PLAs  in  a sum 
of  products  (SOP),  called  SOP  PLA,  using  1-bit  and  2-bit  triggered  decoders,  respectively. 
By  using  the  charge  sharing  technique  for  the  implementation  of  cascaded  AND  array  in 
NOR  structures,  and  the  triggered  input  technique  for  the  deletion  of  ground  switches, 
these  PLAs  are  faster  and  require  lower  power  dissipation  than  the  conventional  single 
phase  dynamic  CMOS  WtPtQtOR-NOR-NOT  PLA.  By  using  triggered  2-bit  decoders  on 
the  input  during  the  precharge  time,  the  capacitance^  of  an  input  minterm  of  a PLA  can  be 
minimized  to  reduce  power  [5,6].  Therefore,  it  is  possible  to  make  a faster  PLA  employing 
the  triggered  input  decoder  circuits  and  the  charge  sharing  technique. 


2 Dynamic  Single  Phase  CMOS  PLA 

CMOS  PLA  operations  may  be  divided  into  two  classes:  pseudo-nMOS  and  dynamic 
CMOS.  Advantages  of  the  pseudo-nMOS  PLAs  include  simplicity  and  small  area.  Disad- 
vantages are  due  to  the  static  power  dissipation.  The  dynamic  CMOS  PLAs  generate  less 
power  and  ground  noise  than  the  pseudo-nMOS  PLAs.  The  pseudo-nMOS  PLAs  are  faster 
than  dynamic  CMOS  PLAs,  but  for  large  PLA  layouts  the  power  dissipation  is  excessive, 
thus  forcing  the  designer  to  go  to  dynamic  CMOS. 

A modified  schematic  of  a conventional  single  phase  dynamic  CMOS  NpTzNOR^NOR- 
NOT  PLA  [3]  is  shown  in  Figure  1.  This  PLA  is  known  to  be  the  fastest  [4],  but  at  the 
expense  of  the  cost  of  wasted  power.  The  ground  switches  are  charged  and  discharged 
every  cycle.  They  connect  the  sources  of  all  the  AND  array  input  transistors,  and  for 
layout  compactness  are  built  in  diffusion.  This  results  in  a high  capacitance  of  the  order  of 
tens  of  picofarads.  For  larger  minterms,  the  additional  capacitance  of  the  ground  switches 
is  significant. 


2.1  Triggered  Input  Logic  for  Dynamic  CMOS  Logic 

Dynamic  CMOS  circuits  have  higher  speed  and  lower  chip  area  than  the  conventional 
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static  CMOS  logic.  However,  it  can  only  implement  noninverting  functions.  This  is  because 
every  Domino  CMOS  gate  has  to  be  followed  by  an  inverter  and  the  output  of  any  dynamic 
CMOS  gate  cannot  be  fed  directly  to  another  gate.  It  is  possible  to  avoid  this  problem  by 
using  a triggered  1-bit  decoder. 

Let  us  assume  that  every  dynamic  triggered  1-bit  signal  (a,  a')  is  reset  to  (0,0)  during 
the  precharge  period.  A triggered  1-bit  decoder  signal  as  a two-valued  ’’one”  signal  A is 
represented  as  (1,0),  and  the  corresponding  two-valued  ’’zero”  signal  A as  (0,1)  during 
the  evaluate  period.  For  example,  a two-valued  static  signal  A can  be  represented  as 
a triggered  1-bit  decoder  signal  (a,  a')  using  a two- variable  two- valued  signal  with  three 
states,  as  shown  in  Table  1. 

Figure  2 shows  a triggered  1-bit  decoder  for  dynamic  CMOS  logic  circuits.  It  consists 
essentially  of  a general  signal  a and  its  complementary  signal  a'.  During  the  precharge 
time  (^  = 1),  n-devices  (2,6)  are  on  to  be  precharged  to  low  at  the  outputs  (a  and  a ), 
and  p-devices  (1,5)  are  off,  so  the  output  signals  are  low,  and  p-device  (3)  and  n-device  (4) 
together  run  as  an  inverter.  During  the  evaluate  time  (<f>  = 0),  when  n-devices  (2,6)  are  off 
and  p-devices  (1,5)  are  on,  the  values  of  drains  of  p-devices  (1,5)  can  be  transferred  at  the 
outputs  a and  a'.  Only  the  logic  values  of  drains  in  p-devices  (1,5)  can  be  transferred  to 
the  outputs  respectively,  because  the  outputs  a and  a1  are  low  initially  during  the  evaluate 
time  (^  = 0).  In  order  to  make  the  same  rise  time  of  both  a complementary  signal  a'  and 
a general  signal  a of  any  input  signal,  the  channel  width  of  p-device  3 must  be  designed 
to  be  larger  than  that  of  n-device  4.  This  decreases  the  delay-time  of  input  signals  at  the 
beginning  of  the  evaluation  time. 

Figure  3 shows  a triggered  2-bit  decoder  for  input  signals  of  a dynamic  CMOS  PLA.  A 
2-bit  decoder  decodes  a 2-bit  number  into  4 output  signals.  In  our  case,  during  precharge 
time  the  decoder  output  signals  are  set  to  all  “one”  or  all  “zero”  depending  on  the  SOP  or 
POS  types,  respectively.  The  n-device  (14)  and  the  p-device  (6)  together  run  as  AB  gate. 
This  saves  power  dissipation  and  improves  speed  in  dynamic  CMOS  PLA.  This  is  because 
the  number  of  input  nodes  is  reduced  by  half  leading  to  reduction  of  the  the  capacitance 
of  input  nodes.  If  the  node  N\  is  high,  the  node  N2  is  low,  the  node  of  the  drain  of 
p-device  (4)  is  low,  and  the  node  NA  of  the  gate  of  n-device  (4)  is  low  during  precharge 
time,  then  the  node  Ni  becomes  low  with  threshold  voltage  and  slow  speed.  However, 
because  the  node  JVj  is  low  during  the  precharge  time,  the  transfer  of  logic  1 only  occurs 
through  p-devices  (4,5)  to  the  node  N2  with  high  speed  during  the  evaluation  time.  During 
the  precharge  time  the  n-devices  (13,15,17,19)  work  as  the  ground  switches  in  the  front 
array  part  of  PLAs,  because  they  set  all  input  signals  of  the  front  array  part  to  “zero”.  In 
the  case  of  SOP  PLA  (see  next  section),  all  input  signals  in  the  front  array  part  are  set  to 
“one”  during  the  precharge  time.  In  this  way,  these  triggered  decoder  circuits  allow  the 
input  of  static  CMOS  signals  in  a dynamic  CMOS  PLA. 

2.2  Design  of  a Single  Phase  Dynamic  CMOS  SOP  PLA 

The  schematic  of  a single  phase  dynamic  CMOS  NOT-NOR-NOR-NOT  sum  of  product 
(SOP)  PLA  using  triggered  1-bit  decoders  is  shown  in  Figure  4.  This  SOP  PLA  consists 


2.2.4 


of  triggered  1-bit  decoders,  the  AND  array,  buffers,  and  the  OR  array. 

The  triggered  1-bit  decoder  consists  of  inverters  (such  as  1),  the  p-devices  loads  (such 
as  4,6)  as  the  ground  switches  in  the  AND  (front  part)  array,  and  functional  n-device 
switches  (such  as  3,5).  The  n-device  (3)  acts  as  the  switch  for  a general  signal  and  the 
n-device  (5)  acts  as  the  switch  for  a complementary  signal. 

The  AND  array  consists  of  loads  (such  as  p-device  11,12),  switches  (such  as  n-device 
15,16)  as  the  ground  switches  in  the  OR  (next  part)  array,  and  functional  switches  (such 
as  n-device  19,20)  with  no  ground  switch.  A conventional  dynamic  CMOS  logic  system  in 
two- valued  logic  using  two- variable  two- valued  logic  must  use  ground  switches  to  prevent 
a discharge  path  during  precharge  time.  However,  by  using  triggered  1-bit  decoder  logic, 
ground  switches  are  not  needed.  Thus  this  reduces  power  dissipation  and  improves  speed 
through  the  omission  of  the  ground  switch.  Furthermore,  the  triggered  l-bit  decoders  set 
all  input  signals  in  the  AND  array  to  high  and  all  input  signals  in  the  OR  array  to  low 

during  precharge  time.  In  this  way,  the  triggered  decoder  concept  is  suitable  for  a dynamic 
CMOS  PLA  system. 

Charge  sharing  is  usually  a problem  in  the  design  of  dynamic  CMOS  AND  gates. 
However,  the  proposed  NOR  gates  which  use  charge  sharing  techniques  are  suitable  for 
the  implementation  of  the  AND  array.  This  charge  sharing  technique  in  the  AND  array 
overcomes  the  difficulty  of  cascading  single  phase  dynamic  CMOS  gates  without  the  ground 
switches.  All  inputs  are  assumed  stable  before  the  evaluation  time.  During  precharge 
time,  when  the  loads  (such  as  11,12)  are  precharged,  the  input  load  nodes  (such  as  Ni ) are 
charged  to  high,  and  all  of  minterm  nodes  (such  as  N2)  in  AND  array  are  discharged  to 
low  because  all  triggered  1-bit  decoder  signals  are  high.  When  the  clock  goes  high  for  the 
evaluation,  all  loads  of  input  minterms  are  turned  off  and  the  minterm  switches  n-device 
(such  as  15,16)  are  turned  on.  Evaluation  paths  will  exist  through  the  AND  array  input 
devices  according  to  the  state  of  the  inputs.  During  evaluation  time,  the  output  of  the  AND 
array  will  conditionally  charge  to  high  if  only  all  inputs  in  the  minterm  of  the  AND  array 
are  high.  This  keeps  all  minterm  fines  to  virtual  ground  except  the  selected  ones,  which 
have  all  input  transistors  connected  to  them  turned  off  with  all  existing  charges  remaining 
shared.  These  AND  gates  work  as  NOT-NOR  gates.  Also,  a minimum  capacitance  ratio 
value  of  CNJCN2  of  2:1  is  required,  where  CNx  is  the  capacitance  of  the  node  Nx  and  CNj 
is  the  capacitance  of  the  node  JVj. 

The  speed  of  the  AND  array  depends  on  the  minterm  switches  (such  as  15,16),  and 
their  maximum  drain  current  depends  on  the  n-device  (such  as  19,20)  and  the  n- device 
(such  as  15,16).  Thus  in  the  range  of  the  maximum  drain  current  (600  /iA)  we  can  increase 
slightly  more  the  width  of  the  n-devices  (15,19)  to  6.5  /im,  to  improve  speed  over  that  of 
a conventional  CMOS  PLA  of  4 /xm.  Figure  5 shows  the  simulated  waveforms  of  internal 
nodes.  If  A and  B are  low,  V4  is  <f>  signal  as  the  input  trigger  voltage,  Vu  is  the  voltage  of 
NOR  gate  using  charge  sharing  at  the  selected  node  N$t  and  V13  is  the  pumping  voltage 
at  the  unselected  node  N2. 

Figure  6 shows  the  simulated  waveforms  of  selected  internal  nodes.  V2  is  the  selected 
output  voltage  at  O' , is  <j6  signal  as  the  input  trigger  voltage,  V22  is  the  voltage  at  the 
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selected  AND  array  node  N3  and  its  steady  state  voltage  is  the  same  as  C^VddUCni  + 
CNj),  and  Vi6  is  the  voltage  at  the  selected  OR  array  node  N5  if  A and  B are  low. 

Figure  7 shows  the  relationship  between  the  pumping  voltage  and  the  width  ratio.  This 
PLA  was  simulated  using  SPICE3dl  based  on  the  channel  width  of  the  n-device  15  (IF15) 
at  6.5  pm,  and  that  of  the  n-device  (19)  (Wig)  at  6.5  pm.  Although  the  channel  width 
ratio  w increases  linearly  with  speed,  it  demands  the  increase  of  the  pumping  voltage  with 
a decrease  in  the  noise  margin.  Thus  in  order  to  prevent  the  incorrect  operation  resulted 
from  pumping  phenomena  (t.  e.,  a reduction  of  the  noise  margin)  at  the  nodes  (such  as 
N2)  during  the  evaluation  time,  and  to  keep  the  virtual  ground,  an  optimal  channel  width 
ratio  Wu/W19  of  1:1  can  be  used.  Therefore,  to  make  a minimum  pumping  voltage,  the 
optimal  width  ratio  w has  “one”  and  in  the  worst  case  only,  one  input  of  the  multi-input 
AND  gate  is  high. 

Figure  8 shows  a dynamic  CMOS  buffer  using  a one-way  NOR  gate..  The  width  of 
the  n-device  (1)  must  be  designed  to  be  shorter  (5  pm)  to  prevent  the  discharge  by  the 
pumping  voltage  relative  to  the  ground  switch  (15  pm).  In  buffers,  the  first  NOR  buffers 
(such  as  35,36)  should  be  designed  so  that  the  logic  threshold  value  of  the  NOR  buffer  is  a 
lower  value  than  Vdd/ 2.  This  measure  improves  the  speed.  These  buffers  can  be  used  to 
improve  speed  because  the  node  N2  has  larger  capacitance  due  to  many  input  variables. 
The  rising  time  of  an  input  minterm  depends  predominantly  on  the  resistance  of  the  n- 
device  (15).  Some  delay  will  be  incurred  due  to  the  finite  pull-up  time.  The  inverters 
(43,44)  are  used  for  synchronizing  the  load  signal  with  the  triggered  decoder  signals  in 
input  minterms.  This  AND  array  using  the  charge  sharing  technique  does  not  require  the 
input  tracking  lines  in  the  SOP  PLA. 

The  OR  array  consists  of  loads  (such  as  p-device  31,32),  inverters  (such  as  33,34),  and 
functional  switches  (such  as  n-device  27,28)  with  no  ground  switch.  Charge  is  dissipated 
only  in  the  selected  output  lines  themselves.  The  power  dissipation  in  the  OR  array  is 
minor  compared  to  the  AND  array. 

By  using  triggered  2-bit  decoders  on  the  input  during  the  precharge  time,  a number 
of  input  minterms  of  a PLA  can  be  minimized  to  reduce  power  and  to  improve  speed. 
Therefore,  it  is  possible  to  make  a faster  PLA  with  no  ground  switch.  This  SOP  PLA  is 
suitable  for  the  implementation  of  the  dynamic  CMOS  PLA  which  has  a lower  number  of 
the  AND  array  minterms  and  a greater  number  of  the  OR  array  minterms. 

3 Simulation  Results  and  Conclusions 

To  compare  the  performance  of  the  single  phase  dynamic  CMOS  PLAs,  each  of  the 
PLAs  described  in  previous  sections  was  simulated  using  SPICE3dl.  The  simulated  wave- 
forms of  the  various  single  phase  dynamic  CMOS  PLA  are  shown  in  Figure  9 and  10.  The 
input  waveforms  are  V(35)  and  V(4),  and  the  output  waveforms  of  the  front  array  are 
V(26)  and  V(16).  the  output  waveforms  are  V(3)  and  V(2)  in  Figure  9 and  Figure  10, 
respectively.  In  simulation,  the  drain  maximum  current  of  600  pA  in  the  input  functional 
n-device  is  used,  the  gate  length  is  2.0  pm,  the  Vro  of  n-device  is  0.71  V,  the  Vto  of 
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p-device  is  0.80  V,  the  drain  capacitance  Cjy  in  the  n-device  is  assumed  11.43  /F,  the  gate 
capacitance  Cq  in  the  n-device  is  assumed  2.9  /F,  Vdd  is  5 V,  the  node  number  of  an  input 
min  term  is  130  with  1.5  pF  capacitances,  and  the  node  number  of  an  output  minterm  is 
130  with  1.5  pF  capacitances. 

Table  2 shows  a comparison  of  simulation  results  for  various  single  phase  CMOS  PLA 
types,  where  both  the  input  minterm  and  the  output  minterm  are  assumed  to  have  the 
same  capacitance,  the  number  of  input  minterms  is  m and  the  number  of  output  minterms 
is  7i.  Tf  is  the  delay- time  of  the  front  array  of  a PLA  and  Pj  is  the  normalized  average 
power  of  the  front  array  of  a PLA.  Tb  is  the  delay-time  of  the  back  array  of  a PLA  and  Pb 
is  the  normalized  average  power  of  the  back  array  of  a PLA.  Tt  is  the  total  delay-time  of 
a PLA  and  Pt  is  the  total  normalized  average  power  of  a PLA.  The  worst  case  total  delay 
time  of  a conventional  single  phase  dynamic  CMOS  PLA  is  35.5  ns.  The  SOP  PLA  using 
the  triggered  1-bit  decoder  and  the  2-SOP  PLA  using  the  triggered  2-bit  decoder  are  2 
and  2.23  times  faster,  respectively,  than  the  conventional  CMOS  PLA. 

The  normalized  average  power  can  be  considered  as  the  total  charges  in  a minterm.  The 
front  array  m the  conventional  PLA  has  the  number  of  average  selected  minterms  of  — -j—  1 , 
where  the  1 is  the  input  tracking  line.  The  back  array  has  the  number  of  average  selected 
minterms  of  f . The  “5”  is  the  charge  of  a minterm  and  the  “4”  is  the  charge  of  a ground 
switch.  The  proposed  AND  array  using  charge  sharing  technique  in  the  SOP  PLA  Las 
the  number  of  average  selected  minterms  of  —■  and  j,  respectively.  The  selected  minterm 
has  wasted  j normalized  charge  and  the  unselected  minterm  has  wasted  10  normalized 
charge.  The  charge  of  the  front  array  in  the  2-SOP  PLA  is  a half  charge  of  that  in  the  SOP 
PLA  because  of  using  the  triggered  2-bit  decoders.  Thus  the  proposed  PLA  structures  are 
faster  and  require  lower  power  dissipation  than  the  conventional  single  phase  dynamic 
CMOS  NOT-NOR-NOR-NOT  PLA,  because  of  the  elimination  of  the  ground  switch.  For 
a typical  example  of  PLA  like  the  Signetics  82S100  with  16  inputs,  48  input  minterms  (m) 
and  8 output  minterms  (n),  the  2-SOP  PLA  using  the  triggered  2-bit  decoder  is  2.23  times 
faster  and  has  2.1  times  less  power  dissipation  than  the  conventional  PLA.  Therefore,  the 
proposed  2-SOP  PLA  using  the  triggered  2-bit  decoder  is  a faster  dynamic  CMOS  PLA, 
and  this  PLA  has  no  input  tracking  line. 
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Figure  1:  Conventional  single  phase  (NOT-NOR)-(NOT-NOT)-(NOR-NOT)  PLA 


1-bit  triggered  1-bit 

Time  signal  decoder  signal 

A a a!  a af 

precharge  * 0 0 11 

evaluate  0 0 1 ~T  0~~ 

I 1 0 0 I 


Table  1:  Encoding  a 1-bit  signal  into  a triggered  X-bit  decoder  signal  using  a two- variable 
two- valued  signal  in  a dynamic  CMOS  logic  system 


Figure  2:  CMOS  implementation  of  a triggered  1-bit  decoder  for  dynamic  CMOS  logic 
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Figure  3:  Static  CMOS  implementation  of  a triggered  2-bit  decoder 


Figure  4:  Single  phase  dynamic  CMOS  (NOT-NOR)-(NOT-NOT)-(NOR-NOT)  PLA  in 
sum  of  products  (SOP)  using  triggered  1-bit  decoders 


Table  2:  Comparison  of  simulation  results  for  various  single  phase  CMOS  PLA  types 
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-*  v(13)  “ v(4) 

“ v(12) 


Figure  6:  Simulated  waveforms  of  internal  selected  nodes  in  a single  phase  dynamic  CMOS 
SOP  PL  A using  the  triggered  X-bit  decoder 
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Pumping  Voiaje  V 


Figure  7:  Simulated  pumping  voltage  versus  channel  width  ratio  w — 1 


VDD 


Figure  8:  Dynamic  CMOS  buffer  using  an  one-way  NOR  gate 
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v(26)  “ v(35) 

” v(3) 


Figure  9:  Simulated  waveforms  of  a conventional  single  phase  dynamic  CMOS  PLA.  V35  is 
the  triggered  voltage  of  <f>,  V3  is  the  output  voltage  at  the  node  O',  and  Vw  is  the  output 
voltage  of  the  AND  array  at  the  gate  of  p- device  (14) 


’ v(26)  ” v(35) 

^ v{3) 


Figure  10:  Simulated  waveforms  of  a single  phase  dynamic  CMOS  SOP  PLA  using  the 
triggered  2-bit  decoder,  All  voltage  numbers  are  the  same  as  SOP  PLA  using  the  triggered 
1-bit  decoder, 
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Abstract  - A new  logic  family,  which  is  immune  to  single  event  upsets,  is  de- 
scribed. Members  of  the  logic  family  are  capable  of  recovery,  regardless  of 
the  shape  of  the  upsetting  event.  Glitch  propagation  from  an  upset  node  is 
also  blocked.  Logic  diagrams  for  an  Inverter,  Nor,  Nand  and  Complex  Gates 
are  provided.  The  logic  family  can  be  implemented  in  a standard,  commercial 
CMOS  process  with  no  additional  masks.  DC,  transient,  static  power,  upset 
recovery  and  layout  characteristics  of  the  new  family,  based  on  a commercial 
lpm  CMOS  N-Well  process,  are  described. 

1 Introduction 

Historically,  the  emphasis  on  Single  Event  Upset  (SEU)  research  has  been  devoted  to 
memory  circuits  [1]— [16] . Memory  circuits  perform  vital  functions  in  any  digital  system, 
as  program  stores,  temporary  registers  and  as  elements  of  state  machines  which  control 
digital  circuits.  An  SEU,  or  soft  error,  caused  by  a charged  particle  striking  a diffusion 
region  in  a memory  element  can  prove  catastrophic  to  an  electro-mechanical  system  which 
relies  upon  that  memory  element  for  communication  or  control.  Great  effort  has  been  made 
to  find  memory  structures  which  are  immune  to  SEUs,  or  at  least  mitigate  the  effects  of 
an  upsetting  event.  The  design  of  SEU  immune  memories,  whether  RAM  or  Flip-Flops, 
has  tended  to  ignore  system  level  problems,  such  as  an  SEU  of  a combinational  logic  gate 
which  is  sampled  by  a memory  circuit,  or  an  upset  of  a control  signal  such  as  a clock 
line  or  mux  select.  It  has  been  shown  [17,18]  that  transients  propagated  out  of  or  into 
memory  elements  is  indeed  a real  problem.  Research,  to  find  general  logic  gate  structures 
which  are  SEU  immune,  has  been  primarily  limited  to  resistive  or  capacitive  hardening, 
which  are  basically  low  pass  filtering  approaches  [17,19,20].  Kang  and  Chu  [21]  present 
a logic/circuit  design  approach  but  the  CMOS  inverter  buffers  are  susceptible  to  particle 
hits  on  the  p-type  diffusion.  The  pre-charged  output  node  is  susceptible  to  a particle  strike 
on  the  n-type  diffusion  if  the  pulldown  chain  does  not  evaluate  low.  More  recently  [16] 
and  [22]  have  presented  memory  cells  based  on  logic/circuit  design  techniques.  Only  [22] 
addresses  the  issue  of  glitch  propagation. 

This  paper  presents  a complete  logic  family  which  is  SEU  immune.  Members  of  the 
family  are  constructed,  using  logic/circuit  design  techniques,  to  recover  from  an  SEU, 
regardless  of  the  shape  of  the  upsetting  event.  It  is  also  shown  that  the  logic  family  can 
prevent  glitch  propagation  from  an  upset  node.  The  logic  family  can  be  implemented  in 
a standard,  commercial  CMOS  process  without  any  additional  processing  steps.  The  DC, 
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transient,  static  power,  upset  recovery  and  layout  characteristics  of  the  new  family,  based 
on  a commercial  1/im  CMOS  N-Well  process,  are  presented  in  this  paper. 

Section  2 provides  circuit  configurations  of  members  of  the  logic  family,  including  an 
Inverter,  2-input  Nand,  2-input  Nor,  3-input  OrNand  and  a 3-input  AndNor.  In  addition, 
a description  of  the  SEU  recovery  mechanism  is  presented  and  a means  for  extending 
this  mechanism  to  logic  structures  in  general  is  provided.  DC  characteristics  of  the  SEU 
immune  inverter  are  described  in  Section  3.  Noise  margins,  gain  characteristics  and  the 
effects  of  device  ratioing  and  threshold  voltages  are  discussed.  Section  4 provides  simulation 
results  which  show  that  the  SEU  recovery  mechanism  is  independent  of  the  duration  or 
shape  of  the  upsetting  event.  Blocking  of  glitch  propagation  is  also  presented.  Section 
5 presents  circuit  switching  speed  results  based  on  pair-delay  simulations.  The  effects 
of  device  ratioing  on  switching  speed  are  also  discussed.  Static  power  considerations  are 
presented  in  Section  6 and  physical  layout  issues  are  presented  in  Section  7.  Section  8 
provides  a summary  and  conclusions. 


2 An  SEU  Immune  Logic  Family 

The  literature  related  to  SEU  immune  combinational  logic  is  sparse  and  has  provided  few 
clues  as  to  what  would  be  necessary  to  design  a logic  family  which  provides  immunity  to 
single  event  upsets.  Whitaker  has,  however,  provided  a concise  summary  of  fundamental 
concepts  which  can  be  used  in  the  design  of  SEU  immune  memory  circuits  [22] . First, 
information  must  be  stored  in  two  different  places.  This  provides  a redundancy  and  maim 
tains  a source  of  uncorrupted  data  after  an  SEU.  Second,  feedback  from  the  noncorrupted 
location  of  stored  data  must  cause  the  lost  data  to  recover  after  a particle  strike.  Finally, 
current  induced  by  a particle  hit  flows  from  the  n-type  diffusion  to  the  p-type  diffusion. 
If  a single  type  of  transistor  is  used  to  create  a memory  cell  then  p-transistors  storing  a 1 
cannot  be  upset  and  n-transistors  storing  a 0 cannot  be  upset.  An  understanding  of  these 
three  concepts  and  close  examination  of  the  memory  circuit  presented  in  [22]  has  provided 
the  key  to  the  design  of  an  SEU  immune  logic  family. 

Figure  1 is  a transistor  level  logic  diagram  of  an  SEU  immune  inverter.  The  inverter 
consists  of  two  transistor  networks,  a p-channel  network  and  an  n channel  network.  All 
devices  are  enhancement  mode  transistors.  The  inverter  is  a two  input /two  output  logic 
device  with  P,n  driving  only  p-channel  devices  and  Nin  driving  only  n-channel  devices. 
Node  Pout  can  provide  a source  of  l’s  which  cannot  be  upset  and  node  N0 ut  provides  a 
source  of  0’s  which  cannot  be  upset.  Transistor  M2  is  sized  to  be  weak  compared  to  Ml 
and  transistor  M3  is  sized  to  be  weak  compared  to  M4.  The  SEU  recovery  mechanism 
works  as  follows.  When  the  inputs  to  the  inverter  are  0,  and  JV^t  are  at  a 1.  In 
this  state  only  N^t  can  be  corrupted  by  an  upset.  If  is  hit,  driving  the  node  to  a 0, 
transistor  M2  will  turn  on  but  cannot  overdrive  Ml.  Pmt  will  remain  at  a 1,  transistor  M3 
will  remain  on,  pulling  back  up  to  a 1.  Conversely, If  Pin  and  Nin  are  1,  Pwt  and  N^t 
will  be  at  0 and  only  Pouf  can  be  upset.  If  P^  is  hit,  driving  the  node  to  a 1,  transistor 
M3  will  turn  on  but  being  weak  compared  to  M4,  Nmt  will  remain  pulled  down  to  a 0. 
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Figure  1:  SEU  Immune  inverter. 


The  inverter  follows  the  fundamental  principles  for  SEU  immunity  and  is  therefore  made 
SEU  immune. 

It  is  readily  apparent  that  the  inverter  design  concepts  can  be  applied  to  any  logic  gate 
to  provide  SEU  immunity.  Figures  2,3,4  and  5 are  the  transistor  level  logic  diagrams  of  a 
two-input  NAND,  two-input  NOR,  three-input  OrNand  and  three-input  AndNor  respec- 
tively. In  general,  an  SEU  immune  logic  gate,  implemented  with  this  technique,  requires 
2n  + 2 transistors,  n being  the  number  of  gate  inputs.  In  comparison,  classical  CMOS 
requires  2n  transistors  to  implement  a gate. 

The  logic  family  described  here  can  provide  transient  suppression  of  an  upset  event  as 
well  as  recovery  from  the  upset.  Networks  of  logic  gates  are  connected  such  that  Pout  only 
drives  p-channel  devices  and  Nout  only  drives  n-channel  devices.  If  Pout  is  upset,  driving  the 
node  to  a 1,  the  p-transistor  being  driven  will  be  turned  off  momentarily  without  affecting 
the  output  of  the  following  stage.  If  Nout  is  upset,  driving  the  node  to  a 0,  the  n-transistor 
being  driven  will  be  turned  off  momentarily  without  affecting  the  output  of  the  following 
stage. 

The  above  description  obviously  overlooks  some  of  the  circuit  design  issues  which  would 
be  faced  by  someone  wishing  to  design  with  this  logic  family.  The  family,  although  imple- 
mented in  a CMOS  process  is  ratioed  logic,  with  a ratioing  occurring  between  transistors 
Ml  and  M2  and  between  transistors  M3  and  M4,  This  logic  family,  therefore,  bears  a closer 
resemblance  to  NMOS  than  it  does  to  CMOS.  Additionally,  threshold  voltages  become  a 
design  issue  because  of  the  enhancement  mode  transistors  being  used  to  pull  up  Nout  and 
to  pull  down  Pout ■ Design  implementation  issues  related  to  ratioing  and  threshold  voltages 
are  presented  in  the  following  sections. 


3 Inverter  DC  Characteristics 

The  DC  transfer  function,  of  an  inverter  provides  several  useful  pieces  of  informa- 

tion about  a logic  family.  Noise  margin,  inverter  gain  and  inverter  switch  points  are  all 
characteristics  which  can  be  determined  from  a plot  of  V out  versus  Vin.  A DC  transfer 
function  plot  can  also  show  if  hysteresis  is  present.  The  SPICE  [23]  circuit  simulator  was 
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NAME 

PARAMETER  SET 

VOLTAGE  RANGE 

TEMP. 

WCLVHT 

SLOW  N 

SLOW  P 

r~^  4.1V 

140°  C 

WCHVHT 

SLOW  N 

SLOW  P 

5.5V 

140°  C 

WCLVLT 

SLOWN 

SLOW  P 

4.1V 

-55°C 

WCHVLT 

SLOWN 

SLOW  P 

5.5V 

-55°C 

BCLVHT 

FAST  N 

FAST  P 

4.1V 

140°  C 

BCHVHT 

FAST  N 

FAST  P 

5.5V 

140°  C 

BCLVLT 

FAST  N 

FAST  P 

4.1V 

-55°C 

BCHVLT 

FAST  N 

FAST  P 

5.5V 

-55°C 

FNSPLVHT 

FAST  N 

SLOW  P 

4.1V 

140°  C 

FNSPHVHT 

FAST  N 

SLOW  P 

5.5V 

140°C 

FNSPLVLT 

FAST  N 

SLOW  P 

4.1V  ~! 

-55°C 

FNSPHVLT 

FAST  N 

SLOW  P 

5.5V 

-55°C 

SNFPLVHT 

SLOWN 

FAST  P 

4.1V 

140°C 

SNFPHVHT 

SLOW  N 

FAST  P 

5.5V 

140°  C 

SNFPLVLT 

SLOWN 

FAST  P 

4.1V 

-55°C 

SNFPHVLT 

SLOW  N 

FAST  P 

5.5V 

-55°C 

Table  1:  DC  Transfer  Function  Simulation  Cases 


used  to  generate  DC  transfer  functions  for  the  SEU  immune  inverter  described  in  Section 
2.  Results  of  these  simulations  will  be  presented  here. 

In  a classical  family  of  logic,  such  as  NMOS,  PMOS  or  CMOS  a transistor  0 is  defined 
to  be  the  product  of  the  process  gain  factor,  K'\  and  the  transistor  aspect  ratio,  That 
is  0tran  = The  inverter  0 is  defined  as  the  ratio  of  the  pullup  0 and  the  pulldown 

0,  or  0INV  - The  logic  family  described  in  Section  2 is  a ratioed  logic  family.  In 

this  case  the  ratioing  occurs  between  the  same  type  devices,  and  the  K'  term  cancels  in 
0tran-  Therefore,  0tran  = ^r-  In  this  case  it  is  more  useful  to  define  transistors  as 
strong  (M1,M4)  and  weak  (M2, M3),  instead  of  the  traditional  pullup  and  pulldown.  To 
complicate  matters  further,  0inv  now  has  two  components,  0n  and  0p  which  are  not 
necessarily  equal.  For  the  simulations  presented  here,  0inv  — Bn  — Bp  = @-§ZSQN(l . 

As  weak  is  a relative  term  and  it  was  unknown  what  effect  ratioing  would  have  on 
DC  characteristics,  simulations  were  run  over  16  process  parameter/ voltage/temperature 
cases  on  15  values  of  (3j^v  ranging  from  | to  Table  1 lists  the  16  simulation  cases.  It 
was  necessary  to  run  these  16  cases  in  order  to  determine  what  effect  processing  variations 
would  have  on  the  SEU  immune  inverter.  The  temperature  and  voltage  ranges  cover  those 
required  by  military  specifications  of  integrated  circuits.  “ 

Once  the  DC  simulations  where  performed,  an  inverter  gain  and  noise  margin  analysis 
was  undertaken.  It  is  known  that  ratioed  logic,  particularly  when  threshold  voltage  effects 
are  involved,  has  lower  noise  margins  than  non-ratioed  CMOS  logic.  Ratioing  will  also 
effect  the  gain  of  a logic  gate.  If  the  gain  is  too  low  a signal  will  die  out  after  only  a few 
logic  stages.  In  the  case  of  the  SEU  immune  inverter,  under  the  WCLVLT  case,  gains  of  1 or 
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NOISE  MARGIN  LOW 

LOW 

0.20V 

PlNV  — 7 

BCHVHT 

HIGH 

0.74V 

PlNV  = \ 

FNSPHVLT 

NOISE  MARGIN  HIGH 

LOW 

0.30V 

SNFPLVHT 

HIGH 

1.07  V 

WCLVLT 

INVERTER  GAIN  VARIATIONS 

LOW 

1.6 

HIGH 

11.3 

Table  2:  Noise  margin  and  inverter  gain  variations 

less  were  attained  for  /?/jvv  of  | and  Additionally  negative  noise  margins  were  attained 
for  (3ufv  of  | and  1.  These  /3s  are  of  course  unusable  in  a design.  Both  noise  margin  low 
(immunity  from  positive  spikes)  and  noise  margin  high  (immunity  from  negative  spikes) 
were  analyzed  for  P^t  and  N^t-  Table  2 provides  a summary  of  this  analysis. 

The  inverter  DC  simulations  eliminated  5 /3/jvv  from  further  consideration  and  showed 
that  several  more  could  prove  marginal  in  a design.  With  the  DC  analysis  complete,  the 
SEU  recovery  ability  of  the  inverter  could  be  investigated.  The  results  of  this  investigation 
are  presented  Section  4. 

4 SEU  Recovery  Results 

To  verify  the  SEU  recovery  ability  and  the  transient  suppression  characteristics  of  the  SEU 
immune  inverter,  described  in  Section  2,  SPICE  simulations  were  run  over  the  same  16 
cases  described  in  Section  3.  Both  P^t  and  N^t  were  tested.  Since  inverters  with  (3inv  < f 
where  rejected  during  DC  analysis  only  10  /3/atv,  ranging  from  - to  | , were  simulated  at  this 
stage.  The  SEU  immunity  of  the  logic  family  was  shown  to  be  independent  of  processing 
parameters,  temperature  or  supply  voltage.  The  error  recovery  mechanism  is  provided  by 
the  logical  feedback  of  transistors  M2  and  M3  and  the  ratiping  of  transistor  strengths.  The 
recovery  mechanism  is  also  not  dependent  upon  the  wave  shape  of  the  current  pulse  which 
upsets  the  node.  

The  simulation  circuit  used  to  test  the  recovery  mechanism  consisted  of  a chain  of  3 
identical  inverters.  No  parasitic  capacitance  other  than  self- capacitance  and  that  seen  at 
the  inputs  to  the  next  stage  was  added  to  the  circuit.  The  inputs  to  the  first  inverter 
were  set  up  to  the  proper  initial  conditions.  A voltage  controlled  current  source  was 
connected  to  the  node  to  be  upset.  This  provided  a means  to  inject  charge  into  the  node 
without  attaching  any  parasitic  capacitance.  Additionally  an  ideal  diode,  emulating  the 
parameter  dependent  source/drain  to  substrate/ well  diodes,  was  attached  to  the  node. 
This  diode  did  not  create  any  additional  capacitance.  A current  pulse,  with  a duration  of 
10ns,  and  a magnitude  sufficient  to  forward  bias  the  source/drain  diode,  was  applied  to 
the  node.  The  10ns  pulse  width  was  chosen  because  it  was  longer  than  the  propagation 
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delay  through  the  inverter  as  well  as  being  longer  than  a real  SEU.  Recovery  from  an  SEU 
was  shown  to  be  independent  of  parameter/voltage/temperature  cases.  Although  all  f3INy 
cases  recovered,  SEU  recovery  time  was  dependent  upon  (3inv*  Faster  recovery  times  were 
noted  for  /3jnv  > 

Besides  being  able  to  recover  from  an  upset  event  an  SEU  immune  logic  family  must 
be  able  to  suppress  the  propagation  of  transients  out  of  the  upset  node.  Due  to  the  P-net 
driving  P-net  and  N-net  driving  N-net  configuration  described  in  Section  2,  the  logic  family 
presented  in  this  paper  should  be  able  to  suppress  glitches  caused  by  an  SEU.  SPICE 
simulations  verified  that  this  is  the  case.  The  simulation  circuit  used  to  test  transient 
suppression  was  the  same  as  that  used  for  testing  upset  recovery.  In  this  case,  however, 
a Ins  current  pulse  was  applied  to  the  upset  node.  This  pulse  duration  is  closer  to  what 
one  would  expect  from  a real  SEU.  Transient  suppression  was  measured  at  the  output  of 
the  inverter  being  driven  by  the  upset  node.  If  the  magnitude  of  the  glitch  on  this  output 
was  within  the  noise  margin,  for  the  parameter/voltage/temperature  case  and  0inv  being 
simulated,  the  transient  was  considered  suppressed.  Results  of  these  simulations  indicated 
that  transient  suppression  was  dependent  upon  simulation  cases  as  well  as  0inv-  In  fact, 
any  0inv  5:  i was  rejected  as  unusable,  in  a design,  due  to  poor  transient  suppression 
abilities. 

The  seven  ratios  with  /3/jvv  > 1 remaining  after  the  SEU  recovery/transient  suppression 
simulations  were  subjected  to  a transient  analysis  to  determine  switching  speeds  of  the  SEU 
immune  logic  family.  These  results  are  presented  in  Section  5. 


5 Transient  Analysis  of  the  SEU  Immune  Inverter 

With  a modern  CMOS  process  it  is  possible  to  attain  inverter  gate  delays  of  Ins  or  less.  For 
an  SEU  immune  logic  family  to  be  of  interest  to  the  VLSI  design  community  the  inverter 
described  in  Section  2 should  have  a gate  delay  ai  least  in  the  ns  range.  Transient  analysis 
simulations  show  that  this  is  possible.  SPICE  simulations  were  run  over  the  same  16  cases 
described  in  Sections  3 and  4.  The  simulation  circuit  was  a chain  of  7 identical  inverters. 
Each  inverter  was  loaded  with  a lOOOpF  linear  capacitor.  This  large  capacitor  swamped 
out  any  voltage  dependent  capacitors  associated  with  transistor  source/drain  regions  as 
well  as  gate  capacitances  seen  by  the  inverter  outputs.  The  first  inverter  in  the  chain  was 
excited  by  a step  function,  and  pair  delay  information  was  extracted  from  the  output.  A 
pair  delay  is  defined  to  be  the  delay,  measured  from  mid-point  to  mid-point  of  the  voltage 
swing,  through  a pair  of  inverters.  This  delay  contains  both  a time  delay  rise  and  a time 
delay  fall.  In  non-ratioed  logic,  such  as  classical  CMOS,  inverters  are  designed  to  have 
equal  rise  and  faR  times.  In  a ratioed  logic  family  it  is  not  always  possible  to  design  for 
equal  rise  and  fall  times,  therefore  pair  delay  information  is  more  useful.  In  this  case  4 
pair  delay  values  were  computed,  delay  from  a rising  edge  and  from  a falling  edge,  for 
both  and  Nout-  The  longest  delay  of  these  was  chosen  as  the  worst  case  delay.  At 
the  outset  it  was  unknown  which  parameter/voltage/temperature  case  would  prove  to  be 
that  of  worst  case  speed.  In  classical  CMOS  it  would  be  WCLVHT.  For  this  logic  family 
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Pairdelay  Chart  (Cioa<i  = lOOOpF,  Delay  — /*«) 
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Table  3:  Pair  delay  results. 


it  also  proved  to  be  WCLVHT.  Simulations  were  run  on  all  of  the  surviving  Pinv s,  with 
ten  different  transistor  widths,  ranging  from  2.4/xm  to  24.0^ra.  Pair  delay  charts  for  each 
Pinv  were  constructed.  A table  of  pair  delay  versus  transistor  width  is  provided  in  Table 
3.  As  expected,  because  delay  is  inversely  proportional  to  width,  pair  delays  decrease  as  a 
function  of  transistor  width.  Speed,  another  useful  design  measure,  is  the  linear  function, 

l 

de,<*From  the  results  of  the  SEU  recovery  ability,  described  in  Section  4,  and  the  pair 
delay  information  in  this  section,  it  would  seem  that  Pinv  = °°  would  be  the  best  choice. 
However,  as  in  all  engineering  endeavors  there  is  a practical  limit  to  the  choice  of  Pinv- 
Both  power  dissipation  and  physical  layout  constraints  must  be  considered.  Section  6 and 
Section  7 will  discuss  these  issues,  as  they  relate  to  the  SEU  immune  inverter. 

6 Static  Power 

In  Section  2 it  was  stated  that  the  SEU  logic  family  presented  in  this  paper  was,  in  some 
regards,  more  closely  related  to  NMOS  than  CMOS.  Due  to  the  ratioing  between  the 
normal  transistors  and  the  feedback  transistors,  and  the  effects  of  threshold  voltages,  this 
logic  family  dissipates  static  power.  SPICE  simulations  were  run,  with  the  same  cases 
described  in  previous  sections,  to  characterize  this  power  dissipation,  and  the  effects  of 
Pinv  on  it.  As  expected,  power  dissipation  increased  with  Pinv • The  power  dissipation 
was  worst  under  BCHVLT  conditions  for  both  input  high  and  input  low  conditions.  Static 
power  consumption  may  place  a limit  on  the  number  of  SEU  immune  gates  which  can  be 
placed  on  an  integrated  circuit. 

7 Physical  Layout 

The  SEU  immune  logic  family  presented  in  this  paper  can  be  implemented  in  a standard 
CMOS  process,  using  standard  layout  design  rules.  The  family  does,  however,  have  charac- 
teristics which  makes  physical  layout  of  the  family  different  than  a classical  CMOS  layout. 
A classical  inverter,  for  example,  requires  a minimum  of  two  lines,  the  input  and  the  out- 
put, crossing  the  well  boundary.  The  SEU  immune  inverter  has  two  separate  inputs,  P ^ 
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and  Nmt,  but  they  need  not  cross  the  well  boundary.  However,  there  are  two  feedback  lines 
which  must  cross.  Additionally,  both  VDD  and  VSS  are  required  for  both  n-transistors 
and  p-transistors,  whereas  a classical  inverter  only  requires  one  power  supply  for  each 
transistor  type.  The  signal  connections  are  more  complicated  in  the  SEU  immune  logic 
family  than  in  classical  CMOS.  In  addition,  the  SEU  immune  logic  family  has  two  more 
transistors  than  does  classical  CMOS.  One  should,  therefore,  expect  that  layout  densities 
would  be  less  for  the  SEU  immune  logic  family.  As  designers  acquire  more  experience  with 
layout  considerations  the  attained  densities  should  improve, 
attained. 


8 Summary  and  Conclusions 

This  paper  presented  a complete  logic  family  which  is  SEU  immune.  Members  of  the 
family  are  constructed,  using  logic/circuit  design  techniques,  to  recover  from  an  SEU, 
regardless  of  the  shape  of  the  upsetting  event.  It  was  also  shown  that  the  logic  family  can 
prevent  glitch  propagation  from  an  upset  node.  The  logic  family  can  be  implemented  in 
a standard,  commercial  CMOS  process  without  any  additional  processing  steps.  The  DC, 
transient,  static  power,  upset  recovery  and  layout  characteristics  of  the  new  family,  based 
on  a commercial  Ifim  CMOS  N-Well  process,  were  presented. 

This  logic  family  makes  the  design  of  completely  SEU  immune  integrated  circuits  pos- 
sible. The  simulation  results  presented  in  this  paper  should  prove  useful  to  designers  who 
need  to  implement  SEU  immune  systems. 

A test  chip,  which  will  be  used  to  verify  the  simulations  presented  here,  is  currently 
being  defined. 
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Figure  3:  SEU  Immune  two-input  NOR. 
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Figure  4:  SEU  Immune  three-input  OrNand. 
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Cellular  Logic  Array  for 
Computation  of  Squares  1 

M.  Shamanna,  S.  Whitaker  and  J.  Canaris 
NASA  Space  Engineering  Research  Center 
for  VLSI  System  Design 
University  of  Idaho 
Moscow,  Idaho  83843 

Abstract-  A cellular  logic  array  is  described  for  squaring  binary  numbers.  This 
array  offers  a significant  increase  in  speed,  with  a relatively  small  hardware 
overhead.  This  improvement  is  a result  of  novel  implementation  of  the  formula 
(a;  + y)2  = x2  + y2  + 2 xy.  These  results  can  also  be  incorporated  in  the  existing 
arrays  achieving  considerable  hardware  reduction. 

1 Introduction 

The  advent  of  VLSI  has  spurred  a renewed  interest  in  the  development  of  specialized 
arithmetic  circuits.  Special  arithmetic  functions  like  squares  and  square-roots  are  generally 
implemented  in  software.  However,  when  a machine  is  designed  for  a specific  application, 
wherein  squaring  is  a frequent  process,  it  may  prove  advantageous  in  terms  of  speed  to  use  a 
hardware  implementation.  Most  of  the  approaches,  reported  in  literature  for  squaring  and 
square-rooting,  use  array  multipliers  or  special  purpose  arrays  which  perform  a multitude 
of  other  operations  in  addition  to  squaring.  As  a result,  there  are  very  few  arrays  which  are 
solely  devoted  to  extraction  of  squares.  However,  Dean[l]  has  reported  such  a dedicated 
array  which  is  probably  among  one  of  the  fastest  squaring  circuits  known,  thus  far.  In 
addition,  Dean’s  array  uses  considerably  less  hardware  than  other  arrays  reported  so  far. 
Hence  Dean’s  array  has  been  selected  as  the  obvious  choice  for  comparison  with  the  array 
proposed  in  this  paper.  The  proposed  array,  will  provide  a significant  gain  in  speed,  with 
a very  small  hardware  overhead,  as  compared  to  Dean’s  squarerfl]. 


2 Algorithm 

Dean[l]  has  not  presented  a formal  algorithm  for  his  implementation.  So,  the  widely 
used  general  binary  squaring  algorithm[3]  will  be  presented  first  followed  by  the  proposed 
algorithm  for  purposes  of  clarity  and  easy  understanding.  The  existing  algorithm  for  binary 
squaring  is  generally  formulated  as  follows: 

(l)2  = (01)6 

(ail)2  = (<ii)2  + (0ai01)b  or 

1This  research  was  supported  ( or  partially  supported  ) by  NASA  under  Space  Engineering  Research 
Center  Grant  NAGW-1406. 
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Fj  = F\  + (OaiOl)j, 

where  F\  = (01  )&  if  a2  = 1 and  F%  = (00)&  otherwise.  Similarly,  we  have 

(a3°il)3  = (ojOj)2  + (OOajaiOlJfc,  or 
Fa  = F)  + (OOaj^Oljfc 

In  general  if  ar+1=  1 then, 

Fr  + l = Fr  + Dr 

r times  - 

where  Fr=(arar_1  . . . a2ax)2  is  the  rth  square  and  Dr  = 00^3 aTar^  . . .aa01  is  called 
the  r radicancj*  It  is  obvious  that  Fr+ i = Fr  if  qT+i  — 0.  The  above  iterative  formula 
applies  for  all  r = 1,2,. . , ,n.  Figures  4 and  5 show  the  schematic  details  of  a three  bit 
squaring  array  for  the  above  algorithm[3].  r ; ; 7 ! Jf  

The  proposed  algorithm  makes  use  pf  the  well  known  formula  (a:  + i/)2  = x2  + y2  + 2 xy. 
Consider  a three  bit  number  (a222  + + a02°).  The  LSB-1  and  LSB  of  the  square  of 

any  number  will  respectively  be  0 and  LSB  of  the  original  number  itself.  Therefore, 

(a22  + «i2  + a02  )2  = (a2  + «i)24  + (a2ao)2^+  (aia©  + ao)2^  + Oq. 

The  same  result  can  also  be  achieved  by  the  repeated  application  of  the  formula  (x  + 
y)  + y2  + 2 xy  where  y is  the  LSB  and  x is  the  rest  of  the  binary  number. 

~ (a221)2  + 2(a2ai22)  + 

x V x2  2xy  y* 

= (a2)2  + (a2ax)22  -|-  <1*2° 

- (aa  + aaai)  22  + <n2°  (1) 

Also, 

(a22a  + a121  + a020)2  = (a222  + a121)2  + 2(a2a02*  + a1a021)  + a«2® 

v > V ' ' v / '-V— ' 

* V x3  2 xy  y3 

= (arf1  + aa26)*2J  + (a2a023  + a1a022)  + a02°  (2) 

Equation  1 proves  that  the  LSB-1  bit  and  the  LSB  of  the  final  answer  is  always  0 
and  the  LSB  of  the  original  number  itself  respectively.  Since  multiplication  by  2 implies 
a left-shift  by  one  bit  position  the  term  (2a2a!)  has  been  shifted  from  the  21  bit  position 
to  22  bit  position  in  Equation  1.  This  result  for  a three  bit  binary  number  is  realized  by 
the  array  of  Figure  I.  The  algorithm  can  easily  be  extended  to  any  n bit  number.  The 
novelness  of  the  algorithm  lies  in  the  fact  that  squaring  of  the  number  is  carried  out  in 
steps  coupled  with  the  ingenious  use  of  left-shifts  in  the  bit  positions. 
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3 Comparison 

The  implementation  of  the  proposed  algorithm  for  a 3 bit  and  a 4 bit  number  has  been 
illustrated  in  Figures  1 and  3 respectively.  The  proposed  array  is  built  of  the  basic  half- 
adder cell  shown  in  Figure  2.  Its  function  may  be  defined  as  follows: 
u = (w  + v_a)  © ( xy ) 
v = (w  + v_!)  • (xy) 

The  symbols  + and  ■ stands  for  the  Inclusive-Or  and  And  operations  in  the  above  expres- 
sions. 

The  implementation  of  3 bit  squarer  based  on  Dean’s  algorithm  is  also  illustrated  in 
the  Figures  6 and  7.  The  basic  cell  (Figure  7)  has  two  control  inputs  A and  B.  The  inputs 
on  the  lines  C and  D are  added  in  the  cell,  5 being  the  sum  out  and  P being  the  carry 
out.  When  both  A and  B are  present,  a further  digit  is  added  to  the  sum  (and  carry),  so 
that  the  cell  then  functions  as  a full-adder[l]. 

It  can  be  seen  that  the  proposed  array  has  1 + * whereas  Dean’s  array  [1]  uses 

1 + YZ=i  * cells  resulting  in  a overhead  of  (n  - 2)  cells.  However,  the  hardware  inside  the 
proposed  basic  cell  is  much  simpler,  as  it  utilizes  only  half-adders,  compared  to  full-adders 
in  Dean’s  array.  So  the  increase  in  the  number  of  cells  is  offset  by  the  reduction  in  the 
complexity  of  the  individual  cell.  This  leads  to  the  authors  contention  that  the  hardware 
overhead  which  translates  into  increased  chip  area  is  almost  negligible.  Moreover,  the 
propagation  time  through  the  proposed  array  is  only  nr  as  compared  to  (2n  — 3)r  which 
is  the  delay  through  Dean’s  array.  The  hardware  overhead-speed  gain  relation  follows  the 
square  law  for  most  specialized  arithmetic  arrays.  Here,  an  increase  in  speed  has  been 
accomplished  with  a linear  increase  in  hardware. 

The  proposed  array  has  a number  of  unused  inputs  which  can  be  used  to  add  in  an 
other  number  so  that  the  array  would  function  as  a full  squarer  (all  outputs  in  1 state). 
A specialized  array  of  this  sort  has  a number  of  applications  including  the  generation  of 
binary  logarithms[2]  which  depends  on  iterative  squaring. 


4 Conclusions 

A new  cellular  array  for  extraction  of  squares  of  binary  numbers  has  been  presented.  An 
squaring  algorithm  based  on  the  formula  (a:  + y)2  has  been  described.  The  proposed  array 
provides  impressive  speed  gains  compared  to  the  existing  arrays  at  the  expense  of  negligible 
hardware  overhead.  It  is  hoped,  that  the  algorithm  discussed  in  this  paper  will  provide 
fresh  insights,  to  reduce  redundant  hardware  present  in  most  of  the  existing  squaring 
arrays. 
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Figure  1:  Proposed  squaring  array  for  three  bit  numbers 
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Figure  2:  Basic  cell  used  in  the  proposed  squaring  lirrEy 
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Figure  3:  Proposed  squaring  array  for  four  bit  numbers 


Figure  4:  A three  bit  squaring  array  using  the  general  algorithm 
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Figure  7:  Basic  cell  used  in  Dean’s  array 
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Fault  Tolerant  Sequential  Circuits 
Using  Sequence  Invariant  State 

Machines 
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for  VLSI  System  Design 
University  of  Idaho 
Moscow,  Idaho  83843 

Abstract  - The  idea  of  introducing  redundancy  to  improve  the  reliability  of 
digital  systems  originates  from  papers  published  in  the  1950s.  Since  then, 
redundancy  has  been  recognized  as  a realistic  means  for  constructing  reliable 
systems.  This  paper  will  introduce  a method  using  redundancy  to  reconfigure 
the  Sequence  Invariant  State  Machine  (SISM)  to  achieve  fault  tolerance.  This 
new  architecture  is  most  useful  in  space  applications,  where  recovery  rather 
than  replacement  of  faulty  modules  is  the  only  means  of  maintenance. 


1 Introduction 

Fault  tolerance  is  essential  feature  for  digital  systems  where  reliability,  availability  and 
safety  are  of  vital  importance.  Such  systems  include  aerospace  missions,  where  a recovery 
procedure  must  be  employed  as  means  of  maintenance,  rather  than  replacement  procedures 
which  would  be  impossible  during  such  missions. 

Most  digital  systems  can  be  divided  into  two  functional  blocks:  the  controller  and 
the  data  path.  The  controller  is  a sequential  circuit  that  performs  certain  tasks  based 
on  external  and  internal  information.  A programmable  hardware  architecture  has  been 
developed  that  enables  a controller’s  hardware  to  be  designed  without  a knowledge  of  the 
exact  sequence  of  the  input  data  to  be  incorporated  [1].  This  programmable  architecture 

is  called  a Sequence  Invariant  State  Machine  (SISM). 

This  paper  will  introduce  a method  to  achieve  fault  tolerance  in  the  SISM  design 
using  dynamic  redundancy.  With  this  method,  faulty  controllers  can  recover  and  resume 
operation.  Two  different  architectures  are  proposed  and  analyzed  in  terms  of  transistor 
count,  size  and  fault  detection.  One  architecture  is  clearly  superior  to  the  other. 

2 SISM  Overview 

With  the  SISM  realization,  any  flow  table  can  be  implemented  without  a change  in  the 
hardware  configuration.  That  is  given  the  number  of  states  m and  the  number  of  inputs 
n,  a hardware  circuit  is  easily  derived,  that  can  implement  any  sequence  of  states. 
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Figure  J:  General  SISM  Arcjutecture. 
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Table  1 shows  a general  6 states,  3-input  flow  table.  The  state  assignment  for  this  table 
is  shown  in  Table  2.  Figure  1 shows  the  SISM  architecture  for  one  of  the  next  state  variables 
in  Table  2.  There  are  two  identical  architectures  for  the  remaining  two  variables.  Only 
the  destination  state  codes  are  different.  The  Figure  consist  of  the  following  components. 

• The  destination  state  codes  which  are  derived  from  the  next  state  entries  in  the  state 
assignment  table  by  inspection.  For  example,  the  destination  state  codes  for  state  B 
and  state  variable  y;  are  the  next  state  bits  Y{  associated  with  state  B.  Therefore,  the 
destination  state  codes  for  state  B are  (000,110,101)  under  input  states  (iij/j;  J3) 
and  variables  (yi;  y2;  y3  Respectively. 

• The  input  switch  matrix  which  is  combinational  logic  that  produces  all  the  possible 
next  state  entries  for  each  current  input  state. 

• The  next  state  logic  which  consists  of  an  independent  path  for  each  of  the  present 
states  in  the  state  assignment  flow  table. 

• The  storage  element,  a D-FF,  that  preserves  the  present  state. 

The  current  input  state  selects  the  set  of  potential  next  states  that  the  circuit  can 
assume  (input  column  in  the  flow  table).  The  present  state  variables  select  the  exact  next 
state  (row  in  the  flow  table)  that  the  circuit  will  assume  at  the  next  clock  pulse. 

3 SISM  Implementation 

Two  pass  transistor  networks  which  make  the  SISM  fault  tolerant  will  next  be  discussed 
and  compared  in  terms  of  space  and  the  number  of  transistors.  The  input  switch  matrix 
is  shown  in  both  structures  as  a logic  block,  since  it  is  identical  in  both  designs. 

3.1  FCS  Design 

A Fully  Coded  Structure  (FCS)  [4]  network  is  defined  as  a fully  specified  pass  network 
circuit.  A knowledge  about  the  number  of  next  state  variables  is  sufficient  to  achieve  this 
design.  Thus,  the  FCS  is  a design  by  inspection.  Using  Table  2 as  a reference,  three 
state  variables  are  required  to  implement  this  table.  Therefore,  eight  unique  states  can  be 
represented.  Each  state  will  have  an  independent  branch  with  all  the  variables  as  control 
terms.  Those  branches  are  all  connected  to  the  output  pass  function.  Only  one  branch 
is  activated  by  any  combination  of  control  variables  at  a given  time,  since  each  branch  is 
encoded  uniquely.  The  output  pass  function  is  the  logical  OR  or  the  summation  of  all 
states.  Figure  2 shows  the  complete  FCS  structure  for  the  next  state  variable  VI  in  Table 
2.  The  other  two  variables  have  identical  structure,  but  different  destination  state  codes. 
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3.2  BTS  Design 

A Binary  Tree  Structure  (BTS)  [3]  network  is  defined  as  a pass  network  in  which  exactly 
two  branches  join  at  every  node  and  the  control  term  of  one  branch  is  the  complement  of  the 
control  term  of  the  other  branch.  Generally,  each  control  term  is  a single  control  variable 
and  the  number  of  nodes  exceeds  one.  A BTS  network  is  constructed  by  partitioning 
each  next  state  variable  in  a specific  manner  until  all  the  variables  have  been  partitioned. 
Therefore,  a BTS  network  is  also  designed  by  inspection. 

Consider  the  flow  table  shown  in  Table  1.  Three  variables  are  needed  to  implement  this 
flow  table.  The  procedure  is  general  and  can  be  applied  to  any  state  machine.  Firstly,  start 
with  the  output  node  and  partition  the  variable  Y3  into  two  branches.  One  of  the  branches 
will  have  I3  as  the  control  variable  and  the  other  branch  will  have  Y3  as  the  control 
variable.  Secondly,  for  each  node  at  the  end  of  each  of  the  newly  constructed  branchs, 
construct  two  more  branches  for  the  control  variable  Yj  and  its  complement.  Thirdly,  for 
each  node  at  the  end  of  the  new  branch,  construct  two  branches  for  the  variable  Yi  and 
its  complement.  With  this  step  the  design  structure  is  completed.  Figure  3 shows  the 
complete  BTS  structure  for  the  next  state  variable  Yi.  The  other  two  next  state  variables 
are  identical  in  structure  and  only  the  destination  state  codes  are  different. 

3.3  Comparison 

The  BTS  and  FCS  structures  both  use  pass  transistor  networks.  The  number  of  transistors 
in  the  BTS  structure  is  less  than  the  number  of  transistors  in  the  FCS  structure,  since 
the  BTS  structure  is  partitioned  around  each  control  variable.  In  terms  of  space  and  size, 
the  BTS  would  appear  to  require  less  space.  However,  using  the  SISM  compiler  developed 
by  Buehler  [2]  to  design  the  BTS  structure,  the  space  required  for  each  design  is  basically 
the  same.  The  extra  space  available  in  the  BTS  structure  is  difficult  to  utilize.  Using  the 
SISM  compiler,  a custom  drawn  SISM  layout  for  one  of  the  variables  in  Table  2,  using 
both  structures,  is  shown  in  Figures  4 and  5. 


3.4  Destination  State  Codes  Implementation 

The  destination  state  codes  are  all  the  inputs  that  must  be  fed  to  either  the  BTS  or  the 
FCS  structure  in  order  to  implement  a state  table.  The  inputs  can  be  driven  in  several 
ways.  They  could  be  directly  connected  to  VDD/VSS  or  they  could  be  driven  by  the 
output  of  a shift  register.  The  input  array  could  also  be  constructed  as  a programmable 
memory  such  as  EPROM. 

In  order  to  achieve  programmability  in  the  SISM  structure,  the  data  must  not  be  hard- 
wired. If  data  were  implemented  using  VDD  and  VSS  connections  then,  the  programmable 
nature  of  this  design  is  limited  to  single  mask  programmability.  Using  a shift  register  will 
achieve  the  programmability  objectives.  The  shift  register  will,  however,  increase  the  size 
of  the  circuit.  If  the  EPROM  is  implemented  on  the  IC,  the  size  of  the  controller  will  also 
increased  but  since  an  EPROM  cell  is  considerably  smaller  than  a D-FF,  the  size  impact 
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Figure  5:  SISM  layout  using  the  FCS  structure. 
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Table  3:  Fully  specified  flow  table, 
is  much  less  than  that  for  the  shift  register  approach; 


4 Achieving  Fault  Tolerance 


In  incorporating  fault  tolerance  in  any  digital  system,  two  approaches  can  be  considered. 
The  first  approach  is  called  static  redundancy,  also  known  as  fault  masking,  which  uses 
extra  components  such  that  the  effect  of  a faulty  component  is  masked  instantaneously. 
The  second  approach  is  called  dynamic  redundancy,  which  has  extra  components  but  only 
one  component  operates  at  a time.  If  a fault  is  detected  in  the  operating  module,  it  is 
switched  out  and  replaced  bjr  a spare.  This  dynamic  redundancy  requires  consecutive 
actions  of  fault  detection  and  fault  recovery  [5]. 

The  idea  of  dynamic  redundancy  to  achieve  fault  tolerance  can  Be  applied  to  the  3ISM 
structure.  Hence,  tKe  operating  module  refers  to  all  the  paths  ^statesj  in  the  next  state 
selection  loglc^ that  construct  the  state  machine.  And  the  spare  parts  refer  to  the  unutilized 
logic  (redundant  states}  in  the  architecture.  Therefore,  if  a fault  has  been  detected  in  a 
given  state  (i.e.  the  path  that  identifies  that  state),  a spare  path  is  switched  to  replace  the 
current  path  and  correct  operation  is  resumed.  _ . _ 

Most  state  machines  do  not  utilize  all  available  states.  ^Therefore,  some  of  those  statff 
can  be  thought  of  as  spare  states  and  are  redundant.  To  optimize  the  versatility  and 
robustness  of  a controller,  the  redundant  states  can  be  used  to  replace  any  state  which 
exhibits  a malfunction.  By  applying  a method  for  reconfigurability,  the  redundant  staff! 
can  be  used  to  improve  the  reliability  and  to  enhance  the  performance  of  an  IC. 

With  reference  to  Table  1,  there  are  six  states,  therefore  three  variables  are  needed  to 
implement  this  flow  table.  With  three  variables,  a maximum  of  eight  states  are  available. 
Six  of  these  states  are  used  and  two  states  are  redundant.  However,  the  next  state  entries 
for  each  of  the  two  redundant  states  have  been  assigned  the  initial  value  (which  is  a safe 
output  In  all  cases)  as  shown  in  Table  3,  with  the  assumption  That  state  A Is  the  initial 
state.  If  state  B tested  faulty,  then  one  of  the  redundant  states,  such  as  state  G,  could  be 
used  to  replace  state  B to  achieve  correct  operation. 

Both  the  BTS  and  the  PCS  will  have  extra  logic,  arid  the  reconfigurability  method  can 
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be  applied  to  use  the  extra  logic.  However,  the  location  of  a fault  in  the  BTS  can  limit  the 
use  of  the  redundant  logic  and  therefore  decrease  fault  tolerance.  That  is,  if  a fault  affects 
any  of  the  transistors  controlling  Yi  or  its  complement  in  Figure  3,  then  the  method  is  valid 
and  redundant  logic  can  be  used  to  replace  that  faulty  branch.  However,  if  a fault  affects 
Y3  or  its  complement  then,  there  is  not  enough  redundant  logic  to  replace  the  entire  faulty 
section.  Therefore,  the  redundant  logic  has  limited  capabilities  in  the  BTS  structure.  An 
identical  structure  can  be  added,  but  in  doing  so  static  redundancy  can  be  achieved  easily, 
at  the  cost  of  increasing  the  structure  size  by  a factor  of  two. 

The  FCS  structure,  possesses  a good  structure.  If  any  s-a-fault  or  s-op  faults  occur  at 
the  input  or  in  the  structure,  then  only  one  path  (state)  is  effected.  However,  if  a stuck-on 
faults  occur  in  the  structure,  then  two  paths  (states)  will  be  affected  at  most.  For  example 
if  a stuck  at  fault  affects  state  B,  then  only  the  path  that  represents  state  B is  affected 
and  can  be  replaced.  However,  if  a stuck-on  affects  state  B,  then  two  paths  will  be  enabled 
at  the  same  time.  Therefore,  the  redundant  logic  can  be  used  to  replace  this  malfunction 
state.  Hence  the  FCS  structure  is  more  applicable  if  dynamic  redundancy  is  to  be  used. 

Furthermore,  the  redundant  logic  in  the  FCS  structure  does  not  mask  any  of  the  faults 
that  could  occur  in  the  structure.  The  reason  being  that  the  redundant  logic  does  not 
replicate  any  of  the  existing  states.  Therefore,  a fault  in  the  structure  or  even  in  the 
redundant  logic  itself  is  testable. 


5 Design  Procedure 

If  any  path  in  the  FCS  architecture  becomes  faulty  due  to  the  input  being  stuck  at  1 or 
stuck  at  0,  a stuck  open  or  shorted  pass  transistor,  or  any  other  malfunction,  then  the 
entire  path  is  no  longer  correct  and  therefore  must  be  replaced  or  recovered.  To  achieve 
fault  tolerance,  three  methods  must  be  used.  They  are  error  detection,  fault  location, 
followed  by  replacement  and  recovery.  The  primary  concern  is  with  the  replacement  and 
recovery  technique.  Once  the  designer  has  concluded  that  an  error  has  occurred  in  a part 
of  the  IC,  fault  detection  and  location  techniques  are  then  applied  to  detect  and  locate 
the  faulty  part.  If  the  faulty  part  is  in  the  controller  section  of  the  circuit,  then  it  must  be 
determined  where  the  fault  has  occurred,  and  the  kind  of  fault  that  occurred. 

Referring  to  Table  3,  assume  that  the  fault  diagnosis  has  shown  that  state  B is  a faulty 
state.  This  corresponds  to  the  path  (Y^TjjTi)  in  Figure  2,  then  the  following  steps  are 
applied. 

STEP1 

Examine  the  flow  table  at  hand  and  determine  which  of  the  redundant  states  will  be  used 
to  replace  state  B.  Since  this  flow  table  has  two  redundant  states,  State  G is  chosen.  State 
H could  have  just  as  validly  been  chosen,  but  for  simplicity  the  next  state  in  order  was 
chosen.  Hence  state  G,  (Y3;  Y2;  Y)  is  chosen  to  replace  state  B. 

STEP2 

Modify  the  flow  table  to  reflect  the  new  changes.  That  is  scan  the  flow  table  and  replace 
each  next  state  entry  of  B with  the  new  state  G.  Therefore,  every  where  in  the  next  state 
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Table  4:  Second  step  in  the  replacement  procedure. 
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Table  5:  Third  step  in  the  replacement  procedure. 


entry  of  the  state  table,  replace  B with  a G.  Table  4 reflects  this  replacement  process 

STEP3 

Fill  the  next  state  entry  of  state  G with  the  same  next  state  entry  as  that  of  state  B.  That 
is  the  next  state  entries  for  G will  be  the  same  next  state  entries  for  B providing  that  step2 
was  completed.  Table  5 shows  the  result  of  this  step. 

STEP4 

The  next  state  entries  for  state  B are  modified  in  such  a way  that  masks  the  kind  of 
permanent  fault  in  the  hardware. 

1.  If  a stuck  at  fault,  s-op  or  s-on  faults  occur  at  the  input  of  the  destination  state  codes 
or  in  the  input  switch  matrix  or  a s-a-1  or  s-a-0  fault  on  the  destination  codes,  then 
disabling  the  B state  is  sufficient. 

2.  If  a s-op  is  occurred  in  any  of  the  variables,  then  the  path  is  already  disabled. 

3.  If  a s-on  fault  occurs  in  any  of  the  variables,  then  the  destination  state  codes  to  the 

faulty  path  must  be  identical  to  those  of  the  new  path  the  fault  assumes.  That  is$ 
if  the  variable  ^ in  state  B Is  stuck  on,  then  this  state  becomes  Ki)  which  is 

the  same  as  state  F.  Therefore,  the  next  state  entries  of  state  B must  be  the  same  as 
that  of  state  F.  Hence,  when  state  F is  enabled,  state  B is  also  enabled.  To  achieve 
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Table  6:  modified  flow  table. 
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Table  7:  Modified  flow  table. 


correct  operation,  both  states  must  have  the  same  next  state  entries.  As  a result, 
the  fault  is  masked.  Table  6 shows  the  resulting  flow  table. 

STEP5 

The  new  state  assignment  is  then  reflected  in  the  modified  flow  table.  Table  7 shows  the 
state  assignment  and  the  next  state  entries  assignment. 

STEP6 

The  destination  state  codes  derived  from  the  modified  flow  table  determine  the  new  data 
entries  for  the  shift  register. 

With  the  completion  of  Step6,  the  operation  of  the  circuit  can  be  resumed  with  the 
same  expected  results. 

Two  final  points  are  worth  discussing.  Firstly,  if  the  state  machine  does  utilize  all 
of  its  states  then  an  additional  state  variable  must  be  added  to  allow  this  procedure  to 
be  employed.  In  order  to  demonstrate  the  procedure,  the  flow  table  shown  in  Table  8 
is  considered.  As  can  be  seen  there  are  no  extra  states.  Therefore,  a new  state  variable 
is  added  and  then  the  state  assignment  is  revisited  during  the  initial  design  to  achieve 
redundancy.  The  next  state  equations  and  the  hardware  implementation  will  reflect  this 
modification.  The  modified  flow  table  is  shown  in  Table  9. 

Secondly,  this  method  can  be  extended  to  achieve  fault  tolerance  in  the  remaining  parts 
of  the  circuit.  This  would  be  achieved  by  determining  the  faulty  part  and  reconfiguring 
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Table  8:  General  4-states,  2-input  flow  table. 
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Table  9:  Modified  flow  table. 


the  state  machine  in  such  a way  as  not  to  enable  the  faulty  part,  and  to  activate  another 
part  to  replace  it. 
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Abstract-  This  paper  exploits  a VLSI  architecture  for  geometrical  mapping  ad- 
dress computation.  The  geometric  transformation  is  reviewed  under  the  field 
of  plane  projective  geometry,  which  evokes  a set  of  basic  transformations  to 
be  implemented  for  the  general  image  processing.  The  homogeneous  and  2- 
Dimensional  cartesian  coordinates  are  employed  to  represent  the  transforma- 
tions, each  of  which  is  implemented  via  an  augmented  CORDIC  as  a process- 
ing element.  A specific  scheme  for  a processor,  utilizing  fully-pipelining  at  the 
macro-level,  parallel  constant-factor-redundant  arithmetic  and  fully-pipelining 
at  the  micro-level,  is  assessed  to  produce  a single  chip  VLSI  for  the  HDTV 
applications  under  the  current  state-of-art  MOS  technology. 


1 Introduction 

Geometrical  transformations  are  widely  discussed  in  the  field  of  digital  image  processing 
such  as  high- definition  television(HDTV),  image  recognition,  interactive  computer  graphics 
and  vision  processing  [1,2,3].  The  primary  interest  of  these  transformations  is  to  project 
an  image  in  a different  domain,  to  extract  additional  signal  conveying  the  information  of 
the  image.  Moreover,  it  affords  value-added  images  over  the  conventional  displaying  via 
the  high  resolution,  definition,  and  flexible  framing.  Consequently,  a geometrical  mapping 
processor  is  about  to  appear  to  support  a real-time  processing.  In  recent  years,  several 
geometrical  mapping  processing  modules  have  been  developed  and  applied  successfully  for 
an  appropriate  application,  They  are  implemented  either  by  popular  graphics  package  or 
application  software  accompanying  an  acceleration  box  [5],  or  a VLSI  Processor  [6].  We 
are  interested  in  a VLSI  implementation  of  a processor  to  realize  a real-time  speed  for  TV 
image  processing,  with  a sufficient  set  of  transformations  to  make  a value-added  display. 

It  has  been  known  that  two  barriers  have  existed  toward  the  development  of  such  a pro- 
cessor. The  first  is  the  lack  of  a sufficiently  high-speed  arithmetic  computation  technique 
to  generate  the  mathematical  functions  required  for  geometrical  mapping.  The  second  is 
the  need  for  an  extensive  library  of  geometrical  mapping  functions.  To  overcome  these, 
two  key  techniques  have  been  developed  in  [4,6]:  The  first  is  a very  high  speed  radix-2 
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signed-digit  adder  and  the  second  is  a pipelined  micro-programmable  arithmetic  function 
generator.  In  this  paper,  we  study  the  same  problem  with  the  goal  of  optimizing  the  overall 
functionality  and  performance.  We  achieve  this  goal  by  improving  the  basic  cell. 

In  the  following  section,  we  will  review  the  requirement  of  the  geometrical  mapping 
processor  by  introducing  its  definition  and  applications.  In  Section  3,  we  will  study  various 
CORDIC  schemes  to  implement  a basic  cell,  which  can  be  used  to  compose  the  necessary 
function  set  for  the  geometric  transformations. 


2 Geometrical  Mapper 


Transformation  of  a sub-image  requires  a mapping  of  the  sub-image  from  one  point  to 
the  transformed,  pixel  by  pixel.  To  rearrange  the  image,  it  is  necessary  to  calculate  the 
destination  address  of  each  pixel,  which  is  called  a geometrical  mapper 

In  the  field  of  plane  projective  geometry,  transformation  from  a point  to  another  point 
is  represented  as  a multiplication  in  homogeneous  coordinates  [10].  Let  a 2-dimensional  (2- 
D)  point  px  = (x,  y ) is  represented  as  (ax,  ay,  a)  in  right-handed  homogeneous  coordinates, 
with  a non-zero  constant  a.  The  vector  px  is  referenced  to  an  origin  (0,  0).  The  most  useful 
transformations  are  translation,  scaling  and  rotation,  examples  of  which  are  respectively 
defined  as: 

Trans(x,  d)  : translating  px  to  (x  + d,y) 

Rot(x,6)  : rotating  the  vector  px  by  an  angle  of  0 about  x-axis 
Sca/e(x,c)  : scaling  the  vector  px  by  c along  x-axis. 


(x,  y)  ■ Trans(x,  d)  = (x  + d,  y) 
(x>  y)  • Rot(x,  0)  = (xcosd  — ysinO , xsind  + ycosd) 

(x,y)  • Sca/e(x,c)  = (cx,y) 


Or,  the  composite  of  3 different  transformations  in  2-D  is  represented  by 


T = 


c • cosd  sind  0 
— c • sind  cosd  0 , 

cd  0 1. 


(2) 


which  is  called  an  affine  transformation.  The  affine  transformation  is  performed  via  a set 
of  multiplication  and  trigonometric  function. 

Easily  observed,  the  affine  transformation  is  a necessary  transformation  to  map  a sub- 
image into  another  area  of  the  image  domain,  with  sliding,  re-sizing  and  proper  rotation. 
Its  immediate  applications  include  sub-image  generation  for  the  multiple  picture-in-picture 
(^>^)  image  template  generation  for  the  recognition  and  vision/graphics  processing 

Further  sophisticate  transformation  useful  for  the  general  image  processing  is  the  spher* 
ical,  which  basically  transforms  between  the  plane  and  sphere  surfaces.  A spherical  trans- 
formation from  px  to  qx  = (u,  t>j  can  be  represented  by  using  a set  of  elementary  functions. 
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such  as  square  root,  division,  and  squaring  operations. 


u = 
v = 


rx 

y/r 2 — x2  — yJ 
ry 

y/r 2 — x2  — t/2  ’ 


(3) 


where  r denotes  the  curvature  degree  of  sphere  surface.  A conventional  way  to  implement 
the  transformations  starts  from  a software  package,  i.e.,  interactive  graphics  package.  To 
implement  a dedicate  hardware,  possibly  a set  of  modular  structures  in  VLSI,  it  is  necessary 
to  figure  out  a basic  cell  of  those  functions,  and  there  has  been  two  different  approach:  the 
first  based  on  a set  of  elementary  function  generators  and  the  second  on  a programmable 
module.  For  the  first  approach,  fast  function  generators  are  necessary  and  the  performance 
is  limited  by  the  slowest  function  generator.  Apparently,  the  trigonometric  functions  are 
the  bottleneck  while  being  implemented  via  the  first  idea.  To  optimize  the  trigonometric 
function  generation,  while  considering  the  regularity  of  its  structure,  CORDIC  has  been 
suggested  the  recursiveness  of  the  CORDIC  iteration  has  been  misleading  a concept  that 
the  second  approach  is  not  usually  better  than  the  first  one. 

Recently,  as  VLSI  technologies  evolve,  the  effectiveness  of  the  integration  is  not  simply 
a complexity  of  the  multiplication  but  also  implies  a communication  complexity  more 
than  the  multiplication  complexity  include  regularity  of  the  structure,  simplicity  of  the 
design  and  localization  of  the  interfacing.  In  these  senses,  CORDIC  has  been  widely 
reviewed  again,  and  shown  to  be  appropriate  for  a couple  of  algorithmic  processors.  In 
brief,  CORDIC  is  a set  of  recursive  algorithms,  which  can  be  easily  programmed  to  generate 
a set  of  elementary  functions  via  a different  mode  and  a proper  zero-enforcing.  It  is  also 
capable  of  vector-oriented  processing. 


3 CORDIC  Techniques 

In  this  section,  we  will  review  CORDIC  functions  to  i)  perform  a vector  transformation  and 
ii)  generate  elementary  functions.  CORDIC  comprises  of  three  linear  recursive  equations, 
namely  X-,  Y-  and  Z-  recurrences.  Table  1 summarizes  the  computing  mode,  input 
and  output  specifications  of  CORDIC  functions  of  our  interest.  As  shown  in  the  Table, 
these  functions  are  classified  into  two  cases,  one  which  enforces  Z[N]  to  be  zero  (known 
as  rotating  ) and  the  other  which  enforces  Y[N]  to  be  zero(known  as  vectoring  ).  We  will 
discuss  these  cases  in  the  following  sections. 

3.1  Rotating  case 

The  vector  rotation  for  px  = (X[0],V[0])  by  the  angle  6 can  be  realized  by  an  iteration 
algorithm  called  CORDIC  [12]  instead  of  computing  trigonometric  functions  and  applying 
matrix  multiplication.  CORDIC  realizes  a vector  rotation  by  a partial  sum  of  micro-angle 
rotations  with  a pre-fixed  sequence  of  angles.  When  the  rotation  macro- angle  is  represented 
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Mode 

Input 

Enforcing 

Output 

Circular 

z[o]  = «,(*[0|,r[oi) 

Z[N]  = 0 

Rotation  by  6 

Circular 

z[o]  = o,(X[o],y[o]) 

Y[N]  = 0 

X[W]  = y^dp  4-  y [Op 
Z[N ] - <an-1(y[0]/X[0]) 

Linear 

zio)  = o,(x[o],y[o]) 

Y[N]  = 0 

z\n)  = y[o]/x[o] 

hyperbolic 

(Jf[o],yfd|) 

y[iv]  = o 

JT[JV]  = yx[0p  - y[oi» 

Table  1:  Available  CORDIC  Processing 


as  a sum  of  decomposed  micro- angles,  i.e  0 = Hk-o 


fc=o 


1 —tanO), 
tandk  1 


(4) 


where  kk  — cosO^i,  is  a micro-scale  composing  a final  scale  factor,  explained  later.  Such 
a specific  form  of  the  pre-fixed  micro-angle  sequence  as  tan-*  2-*,  is  attractive  for  VLSI 
implementation  since  it  is  composed  only  of  additions,  shifting^,  and  a arctangent  lookup 
fable  '-z.' ..:L 

Non-redundant : The  micro-iterations  of  the  conventional  (hereafter,  it  will  be  called 
non-redundant  ) CORDIC  use  the  following  3 linear  recursive  equations  [12]: 

X[i  4-  lj  = x\{\  + m<Ti ¥W\i] 

K[i  + 1]  = y[i]  — cr^'XJi] 

Z\i  -(- 1]  = Z[i]  — <n  tan-*  2-*  (5) 

where  m will  be  set  to  one  for  the  circular  CORDIC,  while  m = 0 for  the  linear  and  — 1 
for  the  hyperbolic.  With  an  initial  value  of  IjO]  = 0,  CORDIC  rotates  initial  values  of 
X[0]  and  V[0],  to  the  last  value  Jf  [ nj  and  Y [ti]  while  making  Z [i]  close  to  zero  in  each  i 
iteration,  so  that  Z[n ] is  forced  to  be  zero.  With  n number  of  iterations,  n-bit  accuracy  of 
X\N]  and  y[JV]  can  be  achieved.  For  a known  angle,  the  direction  of  the  rotation,  <r,  can 
be  pre-computed  or  calculated  one  by  one  on-the-fly  using  the  following  selection  function. 


if  Z[i]  > 0 
if  Z[i]  < 0 


(6) 


The  CORDIC  rotation  does  not  preserve  the  input  norm.  To  get  a rotated  vector  having 
the  same  length  as  the  input  (X[0],  V[0]),  X[n](y[n])  needs  to  be  compensated  by  a scaling 
factor  K 


_ l|[*[n],yjnj]«j] 

ll[jf[b],y[o]]*|| 


= n 


(?) 


where  ||  • ||  stands  for  the  norm  of  the  vector.  Note  that  K is  constant  for  the  non-redundant 
scheme  since  <7j  is  in  {-1,  1}. 
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Redundant  : Non- redundant  CORDIC  is  slow  inherently  with  delay  of  0(n2)  due  to 
its  recursiveness  and  serial  dependency,  since  a micro-rotation  with  delay  O(n)  should  be 
finished  before  processing  the  next  micro-rotation.  Delay  performance  of  a macro-rotation 
(n  micro-rotations)  can  be  improved  from  0(n2)  to  0(n ) by  using  redundant  arithmetic 
(carry-free  addition  such  as  carry  save  or  signed-digit  addition)  to  determine  the  direction 
of  the  rotation  <7j,  based  on  an  estimate  instead  of  an  exact  value  [14].  The  redundant 
arithmetic  gives  a delay  of  0(1)  instead  of  0(n),  and  the  estimation  of  direction  is  necessary 
not  to  erode  the  advantage  of  0(1).  This  requires  the  modification  of  the  recurrences  and 
selection  function.  This  redundant  CORDIC  scheme  produces  the  output  about  4 times 
faster  than  the  non-redundant  [14].  However,  it  introduces  additional  cost  since  the  scale 
factor  K is  variable  depending  on  a macro-angle  by  allowing  d*  to  be  in  {-1,  0,  1}. 

Constant-Factor-Redundant : To  reduce  implementation  cost  of  redundant  CORDIC, 
it  would  be  good  to  have  a constant  scale  factor  by  forcing  d-,-  in  {-1,  1}.  However,  since 
is  determined  from  an  estimate,  there  arises  a convergence  assurance  question.  A scheme 
appending  correcting  iteration  stages  at  proper  positions  was  proposed  for  it  [15].  Along 
to  this  idea,  the  number  of  extra  correcting  iterations  is  further  reduced  by  dividing  the 
micro-iterations  (for  i = 0 to  i = n — 1)  into  two  groups:  one  group  where  the  direction  of 
the  rotation  is  in  {-1,  1}  for  i = 0 to  t = n/2  and  the  other  in  {-1,  0,  1}  for  i = (»  + l)/2 
to  i = n — 1 correcting  iterations  by  50  % since  correcting  iteration  is  not  needed  for  the 
second  half  of  the  micro-iterations  and  we  still  obtain  a constant  scale  factor  K since  the 
value  of  K in  n-bit  precision  does  not  depend  on  the  <r  value  for  (n  + l)/2  < i < (n-l).  Z- 
recurrence  also  can  be  modified  so  that  dj  is  determined  quickly  by  looking  at  a few  most 
significant  bits.  This  new  scheme  is  called  Constant-Factor-Redundant-CORDIC(CFR- 
CORDIC).  The  modified  recurrences  and  selection  functions  for  the  scheme  are  described 
below. 

X[i  + 1]  = X[i]  + d,-2-*Y[i] 

Y[i  + 1]  = Y\i]  - &i 2~iX[i] 

U[i  + 1]  = 2 (U[i]  - dj2‘  tan"1  2-i)  (8) 

where  U\i\  is  for  the  implementation  simplicity,  which  is  equal  to  2lZ[i],  and  the  selection 
function  is  given  as  follows: 

1 if  U[t\  > 0 

or  C/[z]  = 0 fl  t < n/2  . . 

0 *7[i]  = 0 H t > n/2  U 

-1  if  U[i\  < 0 

When  i fractional  bits  are  used  in  the  estimate  value,  i.e.,  U[i]  is  computed  using  t 
fractional  bits  of  redundant  representation  of  N[t],  the  following  correcting  iteration  need 
to  be  included,  where  the  interval  between  indexes  of  correcting  iterations  should  be  less 
than  or  equal  to  (<  — 1)  up  to  the  last  iteration  index  equal  to  n/2.  When  the  correction 
stage  is  necessary  at  the  jih  step  of  micro-iteration, 

Uc[j  + 1]  = U[j  + 1]  - 2df  2jtan-’2-j 


(10) 
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with  the  direction  of  the  rotation  o?  determined  from  the  same  selection  function  of  eq.(  9), 
except  being  decided  based  on  U\j  + 1]  instead  of  J7[t]. 

3.2  Vectoring  case 

While  the  rotating  case  affords  vector-wise  rotation  to  implement  a geometrical  mapper, 
the  vectoring  case  does  elementary  functions  as  in  Table  1.  Apparent  difference  between 
the  vectoring  and  rotating  mode  is  the  zero  enforcing  parameter,  which  necessitates  a 
different  selection  function.  For  the  conventional  CORDIC,  the  recurrence  equations  are 
given: 

X[i  + 1]  = X[x]  + <r,-2-‘Y[i] 


Y[i  + 1]  = Y[ t]  - (T^Xli] 
Z[i  + 1]  = Z[i\  + (Ti  tan-1 2“’ 

(11) 

with  the  following  selection  function, 

f i if y[»] >o 
"“t-i  if  y[t]  < o 

(12) 

The  selection  function  for  CFR-CORDIC  in  vectoring  has  been  developed  shown  below: 
Let  W[t]  = 2*y[i]  in  the  same  token  as  for  the  rotating  case,  then 

X[i  + 1]  = X\t\  + ^2-V[f] 
W\i  + ij  = 2(W[i]  - <7<x[i]) 
Z[i  + 1]  = Z[i]  + (Ti  tan-1 2~* 

(13) 

' 1 if  W[i]  > 0 
. or  W[i]  — 0 fl  i < n/2 

* ” 1 0 W[i]  = 0 n i > n/2 

i,  -1  if  W[i\  < 0 

Here  the  correcting  stage  at  the  jth  step  is  defined  as  foUows: 

(14) 

Wc[j  + 1]  = W[j  + 1]  - 2 afx[j  + 1] 

(15) 

So  far,  we  discussed  about  recursive  structures  of  several  CORDIC  schemes  to  imple- 
ment the  basic  PE.  The  PE,  augmented  by  a translator,  necessitates  scaling  operation  at 
each  stage,  because  shuffling  of  the  output  at  each  stage  makes  continuous  accumulation 
of  the  scaling  factor  complex  toT>e  processed  at  the  final  stage.  The  scaling  operation 
has  been  solved  either  by  an  explicit  way  or  an  implicit.  The  explicit  way  is  dividing  the 
rotated  vector  by  a constant,  which  is  known  for  the  non- redundant,  to  be  calculated  while 
running  the  micro-steps  of  CORDIC  [12,14].  The  division  can  be  processed  by  another 
CORDIC  (in  a linear  mode)  or  a divider.  The  implicit  approach  reconfigures THe' iequence 
of  micro-iterations  of  the  CORDIC,  eventually  to  have  a different  norm  from  that  without 


3rd  NASA  Symposium  on  VLSI  Design  1991 


3.1.7 


scaling  micro-iterations.  Scaling  micro-iterations  target  in  general  at  making  the  adjusted 
scaling  factor  in  a form  of  2‘  or  1,  which  can  be  easily  set  to  the  unit  size.  Each  micro- 
iteration can  be  composed  of  i)  reduction  axis-scaling  [16],  ii)  repetition  of  vector-scaling, 
iii)  expansion  axis-scaling  or  combinations  thereof.  Relevant  issues  regarding  search  for 
the  solution  are  to  be  further  studied,  better  than  the  greedy  method  or  the  decomposed 
search  [18].  In  summary,  the  explicit  scaling  almost  doubles  the  system  complexity,  while 
the  implicit  increases  25  % for  non-redundant  CORDIC  and  about  30  % for  redundant 
CORDIC. 

3.3  VLSI  Scheme 

To  maximize  the  throughput  of  the  geometric  processor,  the  fully  spanned  architecture  is 
selected.  Affine  transformer  is  a trivial  case,  which  can  be  implemented  by  using  a single 
CORDIC  of  which  micro-iteration  is  expanded  to  include  an  addition.  To  implement  a 
spherical  transformer,  4 CORDICs  are  configured:  i)  circular  square  root  of  \/x2  + y2, 
ii)  hyperbolic  square  root  of  \Jr 2 — (v/*2  + y2)2*  an<l  two  ***)  linear  divisions  of  u and 
v.  To  get  first  estimates  of  the  VLSI  size,  a typical  TV  image  processing  application  is 
considered:  O(105)  pixel/image  addressing  and  0(lO_1)sec  screen  flashing.  For  the  case, 
the  number  of  input  bits  « yfpixel  number,  for  which  12  bits  are  sufficient.  To  allow 
possible  interpolations  between  pixels,  6/  is  set  to  be  16.  Each  CORDIC  module  requires 
(bi  + log2bi)  steps  of  micro-iterations,  and  30%  additional  iterations  for  an  implicit  scaling. 

For  the  spherical  transformer,  using  fully  spanned  4-CORDIC,  the  number  of  TRs  are 
estimated  about  30K  (4*6K*1.3). 
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Abstract-  Increased  use  of  on-chip  cache  memories  has  led  researchers  to 
investigate  their  performance  in  the  presence  of  manufacturing  defects.  Sev- 
eral techniques  for  yield  improvement  are  discussed  and  results  are  presented 
which  indicate  that  set-associativity  may  be  used  to  provide  defect  -tolerance 
as  well  as  improve  the  cache  performance.  Tradeoffs  between  several  cache 
organizations  and  replacement  strategies  are  investigated  and  it  is  shown  that 
token-based  replacement  may  be  a suitable  alternative  to  the  widely-used  LRU 
strategy. 

1 Introduction 

The  dramatic  increase  in  cache  memory  size  and  diminishing  geometries  has  resulted  in 
lower  yields.  Today’s  high  performance  processors  often  have  on-chip  cache  and  conse- 
quently the  yield  of  these  memories  can  be  a significant  factor  in  determining  the  ultimate 
cost  of  the  processor.  One  way  of  increasing  yields  is  to  provide  defect-tolerance  through 
the  use  of  redundant  resources.  Two  methods  for  achieving  defect- tolerance  are  commonly 
employed  in  the  design  of  dynamic  RAM  (DRAM)  memories,  namely  the  use  of  error  cor- 
recting codes  and  spare  rows  and  columns  [5] . However,  both  techniques  result  in  increased 
circuitry  and  possible  increases  in  access  times. 

Associative  memories  offer  an  alternative  approach.  By  design,  associative  memories 
have  the  flexibility  necessary  to  function  in  the  prescence  of  defects.  With  the  inclusion 
of  control  logic  it  is  possible  to  force  the  memory  to  operate  “around”  the  defect  and  use 
alternative  locations,  albeit  with  a reduction  in  storage  capacity.  In  the  following  sections 
we  will  describe  basic  cache  memory  operation  and  then  discuss  the  different  techniques 
for  providing  defect-tolerance. 

2 Cache  Operation 

A cache  memory  is  a fast  intermediary  memory  positioned  between  a processor  and  main 
storage.  The  goal  of  a hierarchical  memory  system  is  an  average  access  time  close  to  that 
of  the  cache  memory,  at  a cost  per  bit  approaching  that  of  the  main  memory.  To  achieve 
the  former  the  cache  must  be  designed  to  keep  the  most  frequently  referenced  items  in  the 
cache.  A system  may  designed  with  separate  caches  for  data  and  instructions  or  a single 
(unified)  cache. 
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OF  POOR  QUALITY 


2.1  Organization 

A cache  memory  is  organized  as  sets  of  blocks,  where  each  block  is  typically  4 to  16  bytes 
of  data  from  main  storage.  In  a direct-mapped  cache  each  set  consists  of  only  one  block, 
whereas  in  a n-way,  set-associative 


cache  each  set  contains  n blocks.  The  total  cache 
size  is  the  product  of  the  block  size,  the  number  of  sets,  and  the  associativity,  n. 


2.2  Address  Translation 


Address  references  to  the  cache  are  split  into  three  fields,  the  widths  of  which  depend 
upon  the  cache  size  and  organization.  The  block  field  is  used  Jx>  index  a particular  item 
within  a block  and  is  log2  b bits  wide,  where  b is  the  number  of  addressable  items  within 
a block.  If  s is  the  number  of  sets  in  the  cache,  then  the  set  field  is  log2  s bits  wide  and  is 
used  to  indicate  a particular  set  for  access.  The  remaining  bits  are  referred  to  as  the  lag 
and  are  used  to  distinguish  between  other  blocks  of  main  memory  which  may  be  stored 
in  the  same  set.  Each  block  in  a set  has  storage  to  hold  both  the  block  data  and  the  tag 
associated  with  that  block.  The  collection  of  tag  storage  for  the  cache  is  referred  to  as  the 
tag  directory . During  a memory  access,  the  tag  fielc^for  the  address  is  compared  with  all 
entries  in  the  tag  directory  corresponding  to  the  referenced  set.  If  there  is  a match,  tEe 
data  from  the  matched  block  is  sent  to  the  processor.  If  there  is  no  match,  referred  to  as 
a miss,  then  the  missed  data  must  be  loaded  from  main  memory. 


2.3  Replacement  Policy 


On  a miss,  the  cache  must  decide  where  to  place  the  block  from  main  storage  which 
caused  the  miss.  For  a direct-mapped  cache  the  decision  is  trivial,  as  each  block  from 
main  storage  maps  to  a single  block  in  the  cache.  However,  with  a set-associative  cache, 
assuming  the  referenced  set  is  full,  there  are  n possible  blocks  to  replace.  One  of  the  best 
replacement  algorithms  is  referred  to  as  least  recency  used  (LRU),  where  the  set  is  treated 
as  a stack  and  accessing  a particular  block  moves  that  block  to  the  top  of  the  stack.  The 
least  recently  used  block  is  always  at  the  bottom  of  the  stacF  and  a miss  wdl  load  tEe 
data  into  this  block  and  move  it  to  the  top  of  the  stack.  Efficient  implementations  of  the 
LRU  replacement  algorithm  require  n\n  — l)/2  bits  of  storage  per  set  to  maintain  the  n! 
possible  stack  configurations.  Consequently,  a 4-way  SA  cache  requires  6 bits  of  storage 
per  set,  while  an  8-way  SA  cache  requires  28  bits  per  set.  Additional  circuitry  is  needed  to 
update  the  stack  configuration  as  a result  of  an  access.  Alternative  replacement  strategies 
are  first  in,  first  out  (FIFO),  andlrandom.  The  FIFO  algorithm  is  implemented  using  a 
modulo  b counter  for  each  set,  incremented  on  every  miss  to  that  set.  One  technique  for 
implementing  a pseudo-random  replacement  strategy  is  to  use  a single  modulo  b counter 
for  the  entire  cache  and  increment  it  on  every  miss,  regardless  of  the  set.  This  will  be 
referred  to  as  token-based  replacement. 


nil  IIP  mill  It  I 
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2.4  Discussion 

Several  observations  may  be  made  in  comparing  direct-mapped  caches  to  set-associative 
caches.  First,  for  a given  cache  size,  the  tag  field  and  subsequently  the  tag  directory  will 
be  larger  for  the  SA  cache.  This  is  because  the  n-way,  set-associative  cache  will  have  1/n 
the  number  of  sets  as  the  direct-mapped  cache,  needing  fewer  bits  in  the  set  field,  and 
increasing  the  number  of  bits  in  the  tag  field.  Second,  for  the  set- associative  cache,  n 
comparisons  must  be  conducted  in  parallel  between  the  tag  field  and  the  entries  in  the  tag 
directory.  Furthermore,  the  set-associative  cache  has  an  additional  delay  over  the  direct- 
mapped  cache  as  a result  of  the  need  to  multiplex  the  data  from  each  of  the  blocks  in 
the  referenced  set  to  the  output.  Lastly,  the  SA  cache  has  additional  circuitry  needed  to 
implement  the  replacement  algorithm. 


3 Defect- Tolerance  Strategies 

3.1  Spare  Resources 

There  are  several  methods  for  implementing  memory  reconfiguration  in  the  presence  of 
defects:  electrically  programmable  links,  electron-beam  programmable  fuses,  and  laser 
cutting/welding  [6].  These  techniques  can  be  employed  to  bypass  faulty  resources  and 
activate  spare  units.  The  most  common  technique  for  increasing  memory  yields  is  to  include 
spare  rows  and/or  columns  in  the  data  array  and  sufficient  programmable  decoders.  While 
all  implementations  increase  the  circuit  area,  some  methods  may  also  increase  the  access 
times  and  power  dissipation  [5].  Furthermore,  unless  special  circuitry  is  added  it  is  usually 
not  possible  to  test  the  spare  rows/columns  without  first  programming  the  decoders. 

It  has  recently  been  observered  that  manufacturing  “throughput”,  measured  in  usable 
chips  per  unit  time,  is  dominated  by  the  delay  associated  with  repairing  defective  parts 
rather  than  the  process  yield  [2].  These  researchers  argue  that  efforts  should  be  directed 
at  maximizing  the  throughput,  rather  than  the  yield,  and  propose  algorithms  for  achieving 
this  by  balancing  repair  time  and  yield  of  repaired  parts.  Previously,  production  experience 
with  a 64K  DRAM  indicated  that  the  repair  algorithm  typically  took  several  seconds  and 
represented  roughly  half  of  the  entire  test  time  [10].  The  next  two  sections  describe  methods 
which  eliminate  the  time  needed  to  execute  a repair  algorithm. 


3.2  Error  Correcting  Codes 

Error  correcting  codes  can  be  used  to  correct  single  or  multiple  errors  in  the  tag  directory 
and  data  array  caused  by  manufacturing  defects.  Codes  may  be  selected  to  provide  a 
guaranteed  level  of  protection  at  a corresponding  increase  in  circuit  area  and  access  time. 
A 16-bit  word  would  require  6 extra  check  bits  to  detect  and  correct  all  single  errors. 

In  addition  to  storing  the  check  bits,  additional  circuitry  is  needed  to  encode  or  decode 
during  memory  accesses.  For  large  words,  where  the  use  of  check  bits  is  most  efficient, 
the  delay  associated  with  this  circuitry  can  be  significant.  Results  of  a timing  analysis  are 
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presented  in  [11]  which  indicate  a 20%  increase  in  access  time  using  single  error  correction, 
double  error  detection  coding  of  the  tag  directory  and  data  array.  For  these  reasons,  error 
correcting  coding  is  generally  reserved  for  applications  which  require  tolerance  of  transient 
errors  incurred  during  normal  operation.  However,  Mostek  built  a 1-Mbit  ROM  with  a 
32-bit  word  that  achieved  a 3-fold  improvement  in  yield  at  a 20%  increase  in  area  [8]. 

A distinct  advantage  of  error  correcting  coding  is  the  lack  of  any  “repair”  time.  As  men- 
tioned earlier,  the  delay  associated  with  this  process  can  severely  affect  the  manufacturing 
throughput  for  the  part. 

3.3  Associativity 

Sohi  observed  that  a cache  memory  does  not  have  to  be  defect  free  to  meet  its  objective, 
namely  reduce  the  average  memory  access  time  of  a hierarchical  memory  system  [11]. 
A direct-mapped  cache  memory  with  a defective  block  will  never  be  able  to  hold  items 
from  main  memory  which  map  to  that  set  in  the  cache.  For  a cache  to  operate  properly 
under  this  condition  two  things  are  necessary:  one,  the  cache  must  be  able  to  recognize 
a defective  block  and  generate  a miss  and  two,  must  have  the  capability  of  performing  a 
load  through,  so  that  the  processor  can  access  the  item.  An  associative  cache  has  alternate 
locations  within  a set  which  can  be  used  when  there  is  a defective  block  present.  Ideally, 
the  circuitry  which  implements  the  replacement  algorithm  would  be  modified  at  test  time 
to  exclude  defective  blocks  from  selection  during  replacement.  Provided  each  set  has  at 
least  one  good  block  all  items  from  main  memory  can  map  to  a good  location  in  the  cache. 

4 Related  Work 

Patterson  et  al.  described  the  implementation  of  a cache  memory  in  which  each  cache  block 
was  provided  with  a fault  tolerant  bit,  which  could  permanently  invalidate  a cache  block. 
Set-associativity  was  achieved  through  the  use  of  multiple  chips  and  block  replacement 
was  directed  by  a token  [7].  Accessing  a bad  block  would  result  in  a miss. 

More  recently,  Bergh  et  al.  designed  a fully  associative  fault- tolerant  memory.  Extra 
logic,  amounting  to  a 2%  increase  in  area,  allowed  the  memory  to  completely  bypass 
defective  locations  transparent  to  the  user  [1], 

Finally,  Sohi  investigated  the  performance  under  defects,  as  measured  by  miss  ratio, 
of  three  different  cache  organizations:  direct-mapped,  2-way  set-associative,  and  fully- 
associative  [11].  His  research  illustrated  that  it  is  possible  for  a 2- way  set-associative  cache, 
using  a LRU  replacement  strategy,  to  outperform  a direct-mapped  cache  of  equivalent  size 
in  the  presence  of  defects. 

This  paper  attempts  to  extend  the  work  of  Sohi  in  evaluating  th?  benefits  of  associa- 
tivity for  the  purpose  of  defect-tolerance.  In  this  paper  I focus  upon  set-associative  cache 
memories  for  the  following  reasons: 

• fully-associative  caches  are  generally  not  required  for  many  applications  and  are 
prohibitively  expensive; 
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% Change  in  Hits  Compared  to  2K,  DM  Cache 

Size 

(words) 

DM 

2-way,  SA  1 

4-way,  SA 

LRU 

FIFO 

Token 

|LRU 

FIFO 

Token 

2K 

0 

2.2 

1.7 

1.8 

3.5 

2.8 

2.9 

4K 

5.0 

7.6 

7.2 

7.2 

8.8 

8.2 

8.2 

8K 

9.6 

12.0 

11.7 

11.8 

| 13.1 

12.6 

12.7 

Table  1:  Performance  of  Defect-free  Caches 


• set-associative  caches  possess  the  flexibility  necessary  to  reduce  the  impact  of  defects. 

Specifically,  this  paper  investigates  set-associative  caches  of  various  organizations  under 
three  different  replacement  strategies,  least  recently  used  (LRU),  first-in,  first-out  (FIFO) 
and  token-based.  The  LRU  strategy  is  widely  accepted  as  the  superior  strategy,  although 

costlier  to  implement  [9]. 


5 Simulation  Methods 

Performance  evaluation  was  conducted  using  address  trace  simulation.  The  address  traces 
were  generated  from  runs  of  SPICE,  gcc,  and  T^,  for  a total  of  over  2.8  million  references, 
approximately  75%  of  which  were  instruction  references  [3].  All  address  references  were 
assumed  to  reference  items  of  the  same  size,  namely  one  word.  A wide  variety  of  caches 
were  studied;  however,  in  all  cases  the  block  size  was  held  at  8 words  and  the  cache  was 

treated  as  a unified  cache  (instructions  and  data). 

Three  different  cache  sizes  were  simulated,  ranging  in  size  from  2K  words  to  8K  words. 
For  each  size,  three  different  cache  structures  were  investigated:  direct-map,  2-way  set- 
associative,  and  4-way  set-associative.  Each  associative  cache  was  simulated  using  three 
different  replacement  strategies:  least  recently  used  (LRU),  first  in,  first  out  (FIFO),  and 
token-based.  Lastly,  each  associative  cache  was  simulated  under  three  different  levels  of 

defects,  ranging  from  zero  to  25%. 

During  defect  simulation  each  cache  was  simulated  forty  times,  each  iteration  using  a 
random  distribution  of  defects.  Furthermore,  defect-levels  were  limited  and  the  defects 
distributed  such  that  each  set  was  guaranteed  to  have  at  least  one  good  block.  The 
replacement  strategies  were  modified  from  the  traditional  descriptions  to  prevent  loading 
a missed  block  into  a defective  location. 


6 Results 

6.1  Defect-free  Performance 

Table  1 shows  the  percent  change  in  the  total  number  of  hits  for  various  cache  organizations, 
relative  to  the  total  number  of  hits  for  a 2K,  direct-mapped  (DM)  cache.  From  this  data  we 
can  make  several  observations  regarding  the  relative  performances  of  various  organizations 
under  defect-free  operation: 
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% Change  in  Hits  Compared  to  2K,  DM  Cache 


Size 

(words) 

2-way,  SA 

4- way,  SA 

LRU 

FIFO 

Token 

LRU 

FIFO 

Token 

2K 

-1.0 

-1.4 

-1.6 

1.9 

1.2 

1.3 

4K 

5.7 

5.4 

5.2 

7.5 

6.9 

6.8 

8K 

10.3 

10.1 

9.9 

12.0 

11.4 

11.5 

Table  2:  Performance  with  12.5%  Defect-Level 


• As  cache  size  doubled  there  was  approximately  a 5%  increase  in  hits,  relative  to 
a 2K,  DM  cache,  across  all  structures  and  replacement  algorithms.  However,  this 
effect  would  eventually  diminish  as  the  cache  size  approached  that  of  the  workload’s 
working  set. 

• The  token-based  replacement  algorithm  was  virtually  identical  in  performance  to  the 
FIFO  algorithm  for  all  cache  organizations.  While  at  first  this  may  seem  surprising, 
neither  algorithm  is  a “usage  based”  algorithm  and  consequently  their  performance 
is  roughly  equivalent. 

• LRU  was  the  best  replacement  strategy,  increasing  the  performance  by  roughly  0.5% 
over  the  other  algorithms.  For  a fixed  cache  size,  the  performance  difference  increased 
with  associativity.  As  the  number  of  blocks  per  set  increased,  LRU’s  superior  man- 
agement of  those  resources  became  more  apparent.  For  a fixed  associativity,  n,  the 
improvement  decreased  with  increasing  cache  size.  This  may  be  attributed  to  reduced 
contention  in  the  cache. 

• Doubling  the  associativity  increased  performance  by  approximately  2%,  with  the  im- 
provement diminishing  as  the  associativity  increased.  Again,  this  may  be  attributed 
to  reduced  contention  within  the  cache. 

As  cache  size  increases,  there  is  less  contention  for  space  in  the  cache  and  performance 
differences  due  to  associativity  and  replacement  strategies  tend  to  diminish.  The  same  is 
true  for  a fixed  cache  size  as  associativity  Increases,  particularly  under  LRU  replacement. 


6.2  Performance  under  Defects 

Tables  2 and  3 detail  the  results  of  simulating  various  cache  organizations  under  two 
different  defect-levels.  As  m the  first  table,  the  numbers  represent  the  percent  change  in 
the  total  number  of  hits,  relative  to  a defect-free,  2K,  DM  cache.  At  the  12.5%  defect-level, 
there  was  a drop  in  performance  that  was  a function  of  both  cache  size  and  associativity, 
but  not  replacement  strategy.  For  example,  all  2K,  4- way,  SA  caches  experienced  a decline 
of  approximately  1.6%  from  their  defect-free  performance.  This  can  be  attributed  to  the 
fact  that  each  replacement  algorithm  was  modified  such  that  missed  data  would  never  be 
loaded  into  a defective  block.  The  average  number  of  bad  blocks  per  set  is  equal  to  the 
product  of  the  associativity  and  the  defect-level.  So  at  a 12.5%  defect-level  a 2-way  cache 
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% Change  in  Hits  Compared  to  2K,  DM  Cache 


Size 

(words) 

2- way,  SA 

4-way,  SA 

LRU 

FIFO 

Token 

LRU 

FIFO 

Token 

2K 

-4.1 

-4.4 

-4.7 

0.2 

-0.5 

-0.4 

4K 

3.8 

3.6 

3.4 

6.0 

5.4 

5.3 

8K 

8.5 

8.4 

8.0 

10.7 

10.2 

10.1 

Table  3:  Performance  with  25%  Defect-Level 


will  have,  on  average,  0.25  bad  blocks  per  set,  or  one  bad  block  for  every  four  sets,  while 
a 4- way  cache  will  average  one  bad  block  for  every  other  line.  Consequently,  the  effect  of 
defects  is  to  decrease  the  associativity.  At  low  defect -levels,  and  particularly  for  low  values 
of  associativity,  the  decrease  will  be  minor  and  thus  the  performance  differences  between 
replacement  algorithms  will  remain  approximately  constant. 

In  general,  the  larger  the  associativity  or  the  total  cache  size,  the  smaller  the  drop  in 
performance  due  to  defects.  Increasing  associativity  or  size  are  two  methods  for  reducing 
contention  in  a cache  and  consequently  it  is  expected  that  defects  would  have  a lesser  effect 
on  these  caches.  Another  important  observation,  is  that  all  4-way,  SA,  2K  caches,  regard- 
less of  replacement  algorithm,  outperformed  the  defect-free,  DM,  2K  cache.  Furthermore, 
for  caches  larger  than  2K,  all  associative  caches  with  a 12.5%  defect-level  outperformed  a 
defect  free  DM  cache  of  equivalent  size.  This  is  a clear  example  of  using  associativity  to 
provide  defect-tolerance  and  a performance  improvement.  At  a defect-level  of  25%,  only 
the  4- way,  set-associative  caches  outperformed  the  defect-free,  DM  caches. 

Other  researchers  have  suggested  that  the  use  of  associative  cache  memory  may  be  on 
the  decline  because  as  cache  memories  increase  in  size  the  performance  difference  between 
direct-mapped  and  set-associative  will  decrease  [4].  Furthermore,  a DM  cache  is  always 
smaller  and  faster  than  a S A cache  of  equivalent  capacity,  due  to  the  extra  circuitry  required 
to  implement  the  associativity.  From  our  limited  trials  it  is  difficult  to  validate  such  a trend 
in  performance.  An  8K,  DM  cache  had  9.6%  more  hits  than  a 2K,  DM  cache,  whereas  the 
2- way,  SA  cache  had  12%  more  and  the  4- way  had  13%  more.  These  differences  are  similar 
to  the  differences  observed  for  2K  caches.  Of  course,  common  sense  dictates  that  as  the 
cache  size  approaches  the  size  of  the  working  set  the  differences  will  diminish.  While  this 
may  occur  soon  for  board  level  cache  memories,  the  author  suspects  that  on-chip  cache 
will  continue  to  benefit  from  the  use  of  associativity,  due  to  size  limitations.  Doubling  the 
associativity  and  halving  the  number  of  sets  requires  less  area  than  doubling  the  cache 
capacity. 

7 Summary 

The  results  indicate  that  a set-associative  cache  can  experience  a significant  number  of 
defects  and  still  exceed  the  performance  of  a direct-mapped  cache  of  equivalent  capacity. 
Secondly,  although  the  LRU  replacement  strategy  performed  better  than  FIFO  or  token- 
based  replacement,  the  modest  improvement,  particularly  at  lower  associativities  and  large 
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cache  sizes,  may  not  warrant  the  increase  in  control  logic. 

The  fundamental  question  is:  ltShouId  associativity  be  used  to  increase  manufacturing 
yields  instead  of  spare  rows  and  columns?”  To  answer  this,  one  needs  to  develop  a cost 
function  capable  of  reflecting  the  impact  of  manufacturing  throughput,  circuit  character- 
istics (power,  size,  speed),  and  cache  performance  as  measured  by  miss  ratio.  Several 
observations  may  be  made: 

• If  the  cache  access  time  is  critical  and  the  application  cap  not  tolerate  the  additional 
delay  imposed  by  associativity  then  spare  rows  and  columns  are  the  only  alternative 
for  increasing  yields. 

• If  the  chosen  technology  has  matured  to  the  point  where  manufacturing  throughput 
is  not  severely  affected  by  the  time  needed  to  repair  devices,  then  spare  rows  and 
columns  are  probably  the  logical  selection.  A repaired  part  will  be  guaranteed  to 
have  a full  set  of  defect-free  blocks  and  will  have  known  performance  characteristics. 

• If,  on  the  other  hand,  manufacturing  throughput  is  poor,  due  either  to  low  yields 
or  lengthy  repair  times,  then  using  associativity  may  be  viable  alternative  to  using 
spare  rows  and  columns.  By  doubling  the  associativity  and  halving  the  number  of 

sets,  cache  performance  can  be  improved  even  in  the  presence  of  defective  blocks.  \ 

Repair  time  will  be  minimal  and  simply  involve  marking  defective  blocks  as  unusable. 

Research  is  being  considered  to  evaluate  the  area  overhead  associated  with  enhancing 
the  replacement  algorithms  to  avoid  defective  blocks. 

Perhaps  the  biggest  deterrent  to  using  this  approach  may  be  the  difficulty  in  marketing  I 

such  a device.  Customers  expect  devices  to  be  100%  defect-free  and  might  be  unwilling  to 
order  parts  which  are  guaranteed  to  have  ua  maximum  defect-level,”  particularly  as  two  ^ 

devices  with  the  same  defect-level  will  not  perform  identically  on  the  same  workload. 
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s p r o c - A Multiple-Processor  DSP  IC 

R.  Davis 

Hewlett-Packard  ICBD 
Corvallis,  OR 

Abstract-  A large  single-chip  multiple-processor  digital  signal  processing  IC  fab- 
ricated in  HP-Cmos34  is  presented.  The  innovative  architecture  is  best  suited 
for  analog  and  real-time  systems  characterized  by  both  parallel  signal  data 
flows  and  concurrent  logic  processing.  The  IC  is  supported  by  a powerful  devel- 
opment system  that  transforms  graphical  signal  flow  graphs  into  production- 
ready  systems  in  minutes.  Automatic  compiler  partitioning  of  tasks  among 
four  on-chip  processors  gives  the  IC  the  signal  processing  power  of  several 
conventional  DSP  chips. 


1 Introduction 

Digital  signal  processing  (DSP)  involves  the  real-time  acquisition  of  analog  (continuous) 
inputs,  their  analysis  and  processing  in  a digital  system,  and  subsequent  synthesis  and 
reintroduction  back  to  the  analog  domain. 

Conventional  DSP  chips  are  tuned  for  fast  multiply  and  multiply- and-accumulate  (MAC) 
algorithms  on  serial  data  steams  such  as  required  for  filtering  and  spectral  analysis.  These 
algorithms  take  the  ubiquitous  form 


JV-1  AT 

y(n ) = a(*)  * x(n  — *)  + * y(n  ~ k) 

i= 1 fc=l 

that  compute  outputs  as  weighted  sums  of  present  and  past  inputs,  and  past  outputs. 
However,  many  analog  and  real-time  systems  are  better  characterized  by  complex  networks 
of  parallel,  and  often  asynchronous,  data  flows  and  concurrent  logic  processing.  Program- 
ming a conventional  DSP  chip  to  perform  fundamental  scheduling  and  synchronization 
tasks  can  become  intractable.  . 

SPROC  1 , an  IC  and  development  system,  efficiently  manages  concurrency  through 
the  use  of  dedicated  control  circuitry  and  a powerful  compiler  that  automatically  and 
transparently  partitions  tasks  among  several  processors.  It  minimizes  the  number  of  com- 
ponents for  simple  systems,  yet  remains  largely  extensible  for  arbitrarily  complex  designs; 
it  is  easier  to  program  with  its  library  of  customizable  building  blocks;  it  is  easier  to 
debug  with  its  built-in  real-time  probe;  it  facilitates  both  rapid  prototyping  and  produc- 
tion development  on  one  system.  It  features  full  24-bit  fixed-point  precision  with  56-bit 
accumulation  resulting  in  a 144dB  dynamic  range  for  signal  bandwidths  up  to  250  kHz 
and  handles  all  signal  scaling  automatically.  The  chip  can  be  dynamically  reprogrammed, 
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making  adaptive,  self-calibrating,  and  field  upgradeable  systems  easier  to  design.  The  par- 
allel port  supports  Motorola  and  Intel  microprocessor  interface  protocols.  The  IC  can  be 
ganged  to  implement  arbitrarily  complex  systems. 

2 Chip  Programming  and  Development  Cycle 

First,  a signal-flow  diagram  of  the  desired  system  is  graphically  captured  by  selecting, 
placing,  interconnecting,  and  parameterizing  standard  or  customized  function  blocks,  such 
as  signal  generators,  summers,  filters,  etc.  Next,  the  compiler  converts  the  signal-flow 
diagram  into  executable  code,  allocating  tasks  efficiently  between  the  available  processors, 
building  symbol  tables  for  simple  interfacing  to  the  code.  Then,  the  code  is  downloaded 
to  the  SPROC  chip  via  either  the  development  or  target  system.  Finally,  while  the  code 
is  executing,  circuit  nodes  can  be  probed,  parameters  can  be  modified,  and  the  system 
observed  in  real  time. 

The  SPROC  advantages  are  fundamental:  more  complex,  analog  and  real-time  ap- 
plications can  be  realized  in  a fraction  of  the  time;  designs  can  be  observed  in  real-time 
and  modified  on-the-fly;  any  design  that  cat!  be  compiled  is  guaranteed  to  run  on  the 
SPROC  chip.  Higher  designer  productivity  and  improved  performance  translates  into 
short  time-to-market  of  more  creative  and  competitive  systems. 


3 Chip  Architecture 

A Harvard  architecture  employing  separate  program  and  data  busses  allows  concurrency 
in  instruction  fetch,  decode,  execution  and  data  manipulation.  The  major  blocks  are  the 
general  signal  processor  (GSP),  parallel  interface  (HOST),  a serial  interface  (ACCESS), 
serial  interfaces  for  sampled  data  (serial  PORTS),  a DAC  port,  a glue  block  (GLUE), 
and  memory.  An  overview  of  the  system  architecture  is  shown  in  Figure  1. 

SPROC  operates  in  various  configurations  and  modes.  In  Master  mode,  the  system 
boots  from  external  EPROM.  In  Slave  mode,  SPROC  responds  to  an  external  controller 
which  is  either  a microprocessor  or  a master  SPROC.  In  Redundancy  mode,  the  GSPs 
perform  a system  self- test,  attempts  redundancy  and  reconfigures  the  system.  Thus,  while 
the  chip  is  highly  integrated,  it  is  flexible  and  extensible. 

3.1  GSP 

Each  GSP  is  a 24-bit  digital  processor  with  64  instructions  and  eight  addressing  modes. 
Main  blocks  include  program  control,  address  generator,  multiplier,  ALU,  and  decoder. 
Instructions  include  multiply  (MPY)  and  multiply-and-accumulate  (MAC)  that  execute  in 
fifteen  clock  periods.  One  of  up  to  four  GSPs  control  both  program  and  memory  busses 
on  a time-multiplexed  basis.  As  triggered,  a time  slice  for  I/O  operations  via  HOST, 
ACCESS,  PORTS,  or  probing  DAC  is  interjected,  (see  Figure  2) 

P = Program  Bus  Access 
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Figure  1:  System  Architecture 
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Figure  2:  System  Timing 
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D = Data  Bus  Access 

I/O  = HOST,  ACCESS,  PORTS,  or  probing  DAC  Access 

In  Redundancy  mode,  each  GSP  executes  a self-test  code  from  internal  ROM  upon 
power-up.  If  defective,  the  GSP  is  essentially  held  in  reset  and  removed  from  tasking 
operations.  This  enables  otherwise  functional  parts  to  yield  at  wafer  test  and  provide 
fault-tolerance  in  the  field.  The  fault  coverage  of  this  test  is  approximately  70%. 

3.2  HOST 

The  host  interface  (HOST)  is  a 24-bit  asynchronous  bidirectional  parallel  port  with  a 64K 
addressing  range,  and  supports  8,  16,  and  24  bit  transfers.  It  typically  interfaces  to  the 
digital  subsystem  of  the  target  environment.  The  GSPs  can  access  the  HOST  via  LOAD 
and  STORE  instructions.  Internally,  SPROC  has  a 12-bit  addressing  range  with  4 bits 
reserved  for  master  to  slave  addressing  for  memory-mapped  devices  or  ganged  SPROCs. 

3.3  ACCESS 

The  access  port  (ACCESS)  is  a two  port  serial  interface.  It  is  typically  used  to  observe 
and  modify  the  contents  of  internal  memory  while  the  system  is  operating.  The  input  port 
requires  data,  clock,  and  strobe;  the  output  port  drives  a strobe  and  data  based  on  the 
input  port  clock  rate.  Access  is  time  multiplexed  and  is  transparent  to  internal  operations. 
Full  read/write  access  is  provided  to  any  valid  SPROC  address. 

3.4  PORTS 

The  sampled  data  streams  are  supported  by  four  serial  ports  configurable  for  data,  clock, 
strobe,  and  sync.  There  are  two  input  and  two  output  ports  available.  A data  flow  manager 
(DFM)  manages  the  concurrency  of  multiple  GSP  and  data  RAM  accesses.  Very  simply, 
an  input  DFM  writes  input  sample  data  to  consecutive  data  RAM  locations  and  updates  a 
write  pointer.  An  output  DFM  will  subsequently  fetch  output  sample  data  from  the  data 
RAM. 


3.5  GLUE 

The  glue  block  (GLUE)  provides  address  decoding  and  memory  mapping,  mode  control, 
system  cycle  generation,  and  serial  port  timing. 

3.6  DAC 

The  digital- to- analog  port  (DAC)  allows  the  probing  of  any  node  on  the  signal-flow  dia- 
gram. These  nodes  are  represented  internally  as  two’s  complement  FIFO  buffers  in  data 
RAM.  Hence,  a node  can  be  selected  to  direct  its  data  buffer  to  the  on-chip  DAC  port, 
and  the  analog  value  can  be  observed  in  real-time.  An  interned  gain  register  can  be  loaded 
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to  scale  the  digital  value  before  outputting.  The  corresponding  analog  voltage  is  buffered 
and  driven  off  chip,  and  may  be  observed  with  an  oscilloscope,  spectrum  analyzer,  etc. 


4 IC  Design  Methodology 

4.1  Partitioning 

Star  Semiconductor  approached  HP  with  a prototype  system  breadboarded  with  off-the- 
shelf  memory  and  Xilinx  and  Actel  field-programmable  gate  arrayed  logic  and  a desire  for 
fast,  integrated  silicon.  Chip  development  on  the  customer  side  was  primarily  in  Cadence ; 
with  VERILOG  providing  functional,  behavioral,  and  logic  simulation  of  the  system  and 
VERIFAULT  for  fault  analysis.  TA,  a static  timing  analyzer  was  used  for  detailed  timing 
optimization. 

HP  recommended  developing  additional  standard  cells  including  a recirculating  flip- 
flop,  adder,  and  lookahead  cells  to  complement  its  standard  cell  offering  HP-Cmos34. 
This  resulted  in  enhanced  performance,  less  silicon  area,  and  a more  direct  mapping  of  the 
netlist.  We  also  developed  the  memories,  DAC,  and  OSC  and  the  task  of  global  composition 
and  verification.  Critical  paths  were  simulated  in  SPICE,  and  capacitance  was  fed  back 
to  the  customer  for  final  timing  simulations.  Clock,  power,  and  analog  routing  required 
manual  editing. 

4.2  New  Standard  Cell  Development 

Realizing  the  prevalent  use  of  recirculating  registers  led  to  the  incorporation  of  2,  3,  and 
4-way  multiplexers  into  the  flip-flop  to  minimize  area.  (See  table  1) 

Table  1:  Comparison  of  flip-flops,  multiplexer  combinations 


Intrinsic 

Load 

Library 

Width 

Delay 

Multiplier 

uM 

nS 

nS/pF 

DFFB 

Standard 

54.6 

7.8 

3.4 

DFFF 

Standard 

121.8 

2.6 

1.5 

X1RG1 

New-Std 

46.2 

1.9 

2.1 

MUX2B 

Standard 

37.8 

2.9 

4.8 

XMUX2 

New-Std 

33.6 

1.8 

1.3 

X2RG1 

New-Std 

71.4 

1.9 

2.2 
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Also,  adder  cells  were  developed  including  a slow  1 bit  adder  for  the  multiplier,  a fast 
4 bit  adder,  and  a 4 bit  carry  lookahead  for  the  address  logic.  (See  table  2) 


Table  2 

: Adder  cells 

_ __ 

Library 

Width 

Intrinsic 

Delay 

Load 

Multiplier 

uM 

nS 

nS/pF 

xaddib 

New-Std 

63.8 

4.2 

2.4 

XADD4 

New-Std 

226.8 

1.8 

3.6 

XL00K4H 

New-Std 

189.0 

1.6 

2.9 

sipon  - 

-7  . 

with  the  autorouting  (HARP)  tools  has  been  developed.  First,  blocks  are  routed  with 
random  port  locations  to  determine  size.  Then,  blocks  are  re-routed  with  assigned  port 
locations  determined  by  the  floorplan.  Finally,  the  top  level  is  routed  with  the  pads. 
Developing  the  SPROC  chip  produced  some  enhancements  to  the  process. 


5.1  Routing  Tricks 

Initial  block  sizes  were  estimated  using  the  csize  program  (which  counts  cells  and  adds 
their  areas)  with  estimates  for  routing  overhead.  Port  locations  were  assigned  manually 
taking  into  account  the  initial  floorplan  and  stored  in  a file  for  repeated  runs  and  easy 
modification;  random  assignments  were  only  made  if  a block  had  no  assignment  file.  After 
iteratively  routing  to  reach  an  optimal  block  size,  a frame  was  extracted  and  placed  in  a 
dummy  BDL  file,  which  was  then  combined  with  custom  frames  for  global  routing  including 
pads. 

The  new  approach  had  the  major  advantage  of  flexibility  of  accepting  new  netlists  from 
the  designers  and  in  experimenting  with  different  partitions  and  floorplans  in  short  order. 
Any  piece  could  be  easily  rerouted  and  incorporated  as  desired,  including  the  global  route. 

It  was  a must  that  each  of  the  GSPs  have  optimal  and  identical  performance,  yet 
floorplan  well.  To  accomplish  this,  ports  were  were  duplicated  on  each  side  of  the  block, 
and  the  blocks  mirrored  and  routed  back-to-back.  To  reduce  the  global  routing,  the  block 
consisting  of  two  GSPs  only  had  one  set  of  ports. 

Routing  ALLPORTS,  INTERFACE,  and  GLUE  as  a single  HARP  block  caused  a 
great  dispersal  of  the  major  busses.  Partitioning  these  blocks  and  ports  next  to  a central 
bussing  channel  proved  to  be  more  successful. 
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5.2  Routing  Traps 

Global  power  routing  was  problematic.  Power  estimates  were  determined  by  SPICE  and 
the  logic  simulators.  A package  was  selected  to  provide  several  power  pads  on  each  side. 
This  required  additional  HARP  modification.  Also,  end  cap  cells  were  modified  to  supply 
both  power  supplies  to  either  end  of  the  blocks,  reducing  IR  drops  by  a factor  of  two. 
HARP  was  given  parameters  to  increase  the  sizing  of  power  busses  between  the  blocks, 
each  of  which  had  multiple  power  ports.  Manual  editing  was  required  to  tie  major  power 
straps  together,  which  run  in  pairs  throughout  the  chip.  The  analog  section  was  isolated 
by  breaking  the  pad  ring  and  connecting  it  to  dedicated  power  pads.  Also,  digital  signal 
lines  were  manually  re-rerouted  to  avoid  cross  the  analog  logic. 

Long  global  bussing  of  minimum  width  clock  lines  proved  to  have  unaccepted  RC  wiring 
delays  after  final  routing.  The  clock  tree  had  to  be  resimulated  taking  these  additional 
delays  into  account.  To  minimize  skew,  the  clock  drivers  had  been  placed  in  the  GLUE 
block,  with  the  clock  ports  dispersed  along  one  edge.  The  lines  were  selectively  widened 
to  a full  contact  width  without  penalty.  It  was  sometimes  possible  to  double  the  width 
of  a single  fine  if  the  vias  on  adjacent  lines  were  coincident,  or  to  drop  the  metal  layers 
in  parallel  over  long  isolated  runs.  The  clock  network  was  reduced  to  a clock  grid  by 
effectively  shorting  the  clock  branches  back  together  at  the  top  level. 

6 Custom  Modules 

6.1  RAM 

The  data  and  program  memories  are  identical  IK  word  by  24-bit  six-transistor  static 
RAMs.  A custom  RAM  was  leveraged  to  improve  the  performance,  as  well  as  reduce 
area,  with  respect  to  an  available  RAM  generator.  The  single-core  array  was  developed 
for  simplicity  as  128  rows  of  192  six-transistor  static  RAM  columns.  An  8-to-l  column 
multiplexer  feeds  a passive  sense  inverter  and  non-inverting  tristate  output  buffer  to  achieve 
a 16ns  cycle  time  in  an  area  less  than  10mm2.  About  80  % of  the  area  is  consumed  by  the 
core  array.  A dual  clocking  mode  for  precharge  was  adopted.  In  half-cycle  mode,  the  timing 
is  determined  by  two  edges  of  the  system  clock  up  to  40MHz.  In  internal  clock  mode,  an 
inverter  delay  chain  times  the  precharge  against  one  edge  of  a clock  up  to  50MHz.  (4.75V, 
85°C)  With  a 20ns  cycle  boundary,  the  address  generation  gate  delays,  wiring  delays,  and 
clock  skew  must  be  less  than  4ns  for  50MHz  operation.  Both  RAMS  are  accessed  every 
clock  cycle  and  consume  approximately  600mW  each. 

6.2  ROM 

The  internal  ROM  is  512  words  by  24  bits.  The  core  is  organized  as  64  rows  and  192 
columns.  The  cycle  time  for  the  ROM  is  less  than  16ns.  (4.75V,  85°C)  The  ROM  address 
space  overlaps  the  program  RAM;  while  the  system  is  booting  the  program  RAM  data 
drivers  are  disabled.  The  ROM  artwork  was  logic  simulated  to  verify  the  bit  programming. 
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The  ROM  area  is  0.84mm2. 

6.3  Analog  Blocks 

The  OSC  is  an  internal  ring  oscillator  which  minimizes  component  count  for  lower  cost 
systems.  The  oscillator  drives  the  system  cycle  generator  when  selected.  An  inverter  feed- 
back ring  was  chosen  for  simplicity.  To  reduce  the  frequency  variability,  the  ring  feedback 
is  adjustable  via  programmable  clocked-inverter  taps  decoded  from  three  dedicated  pins. 
The  frequency  variability  is  reduced  to  36%  over  temperature  and  17%  over  voltage  over  a 
tunable  range  of  30MHz  to  80MHz.  A schmitt  trigger  ring  driver  clocks  a toggle  flip-flop 
to  insure  a 50%  duty  cycle.  In  Teat  mode  the  oscillator  is  observable  via  a serial  port.  The 
oscillator  resides  in  the  pad  ring  to  isolate  it  from  the  digital  environment. 

The  DAC  was  selected  from  HP’s  customizable  analog  cell  library  available  in  HP- 
Cmos34.  It  is  based  on  an  8-bit  poly-resistor  string  design.  Of  note  are  Cmos  transmissions 
gates  used  to  make  the  resistor  endpoints  extendible  to  VDD  and  GND.  The  output  swings 
between  these  voltage  references  which  are  sourced  off-chip. 

The  OPAMP  is  a general  purpose  opamp  that  has  a two-stage  input  and  class  AB 
output  is  used  as  a voltage  follower  to  buffer  the  high-impedance  DAC  output.  The 
opamp  can  swing  rail-to-rail  while  driving  a 3K  resistive  and/or  200pF  capacitive  load. 
An  external  compensation  capacitor  allows  processing  in  Cmos34  without  an  extra  mask 
required  for  linear  capacitors.  <•  - r .... 


7 Test  Methodology 

A 50MHz  data  rate  speed  goal  made  the  Schlumberger  S50  the  local  tester  of  choice. 
The  customer  contracted  with  TSSI  (Beaverton, OR)  for  their  software  test  development 
system  (TDS)  which  converts  captured  simulation  vectors  to  test  vectors.  TDS  generates 
S50  MDC  (patterns),  TEG  (timing),  and  pingroups  directly.  A pattern  bridge  (PBridge) 
essential  samples  the  simulation  responses,  checking  and  formatting  for  S50  constraints. 
More  than  900K  vectors  have  been  generated. 


S Results 

First  silicon  was  largely  functional,  with  a major  exception  being  the  corruption  of  one  of 
the  processor  addressing  modes.  Root  cause  was  traced  to  a logic  inversion  in  a Verilog 
model  for  a multiplexer.  As  a result,  first  silicon  could  not  boot  from  ROM  and  hence  run 
the  redundancy  code  for  self-test  and  configuration. 

Second  silicon  was  a quick,  metall/via/metal2  turn  to  correct  the  addressing  mode, 
and  the  silicon  was  fully  functional  for  software  development  and  system  operation  up  to 

20  MHz. 

Third  silicon  was  a full  mask  turn  to  increase  the  performance  of  the  part.  Unfortu- 
nately, a consequence  of  some  of  the  edits  introduced  contention  on  the  processor  address 
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bus,  limiting  performance.  Again,  a quick  turn  is  in  the  offing  to  solve  the  contention  and 
improve  the  performance. 

The  132  pin  CPGA  package  can  be  fitted  with  a heatsink  to  allow  operating  the  chip 
above  20MHz. 

Investigation  into  porting  the  design  into  HP-Cmos26  are  underway.  The  standard 
libraries  are  well-suited  for  50MHz  system  operation,  and  the  reduced  silicon  area  will 
translate  directly  into  a lower  cost  part  and  larger  packaging  offerings. 

Conclusions 

A large  digital  signal  processing  IC  has  been  fabricated  in  HP-Cmos34.  Routing  pro- 
cesses haye  been  improved,  and  the  standard  cell  offering  enhanced  with  additional  cells. 
More  accurate  four-parameter  timing  models  have  been  developed  for  Verilog  and  other 
industry  simulators.  New  software  was  applied  in  the  generation  of  a large  set  of  test 
vectors.  Sharing  the  design  with  the  customer  was  largely  successful  without  major  show- 
stoppers  resulting  in  beta-site  quality  systems  on  schedule.  Efforts  to  port  the  design  into 
HP-Cmos26  are  underway  promising  higher  performance  and  more  competitive  systems. 
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Die  Size 
Routed  Cells 
Custom  RAM 
Custom  ROM 
Total  FETs 
Package 
Power  Supply 
Operating  Power 


13.7mm  x 14.1mm 
56K  gates 
48K  bits 
12K  bits 
540,000 
600mil  132-CPGA 
5.0V  +/-  10% 
2.5W  (40MHz) 


Table  3:  Chip  Characteristics  and  Photomicrograph 
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J.  Chen 

NASA  Space  Engineering  Research  Center  for  VLSI  System  Design 

University  of  Idaho 
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Abstract-  It  has  previously  been  shown  that  the  Reed  Solomon  (RS)  codes  can 
correct  errors  beyond  the  Singleton  and  Rieger  Bounds  with  arbitrarily  small 
probability  of  a miscorrect  [1],  That  is  an  (n,k)  RS  code  can  correct  more  than 
(n-k)/2  errors.  An  implementation  of  such  an  RS  decoder  is  presented  in  this 
paper.  An  existing  RS  decoder,  the  AHA4010,  is  utilized  in  this  work.  This 
decoder  is  specially  useful  for  errors  which  are  patterned  with  a long  burst 
plus  some  random  errors. 


1 Introduction 

It  is  well  known  that  an  (n,k)  RS  code  can  correct  up  to  (n-k)/2  random  errors.  When 
burst  errors  are  involved,  the  error  correcting  ability  of  the  RS  code  can  be  increased 
beyond  (n-k)/2  with  arbitrarily  small  probability  of  a miscorrect  [1].  Errors  considered  in 
this  paper,  called  composite  errors,  have  a single  burst  plus  random  error  pattern. 

RS  codes  are  powerful  error  correcting  codes.  There  is  a rich  history  of  work  developing 
decoding  algorithms  for  RS  codes.  Virtually  all  of  the  work  focuses  on  the  general  case 
of  t unknown  error  locations.  It  is  possible  to  extend  the  error  correction  capability  of  a 
RS  code  if  error  location  information  is  available  from  some  external  source.  This  is  called 
erasure  decoding. 

The  extended  decoding  technique  presented  in  this  paper  assumes  that  the  locations 
of  the  burst  are  known  and  treats  them  as  erasures.  All  possible  burst  error  positions  are 
given  to  the  decoder  sequentially  as  ’’guesses”  to  the  burst  error  location.  That  is,  the 
burst  part  of  the  error  becomes  an  erasure  and  an  erasure-locator  polynomial  is  generated 
from  the  erasure  locations  for  each  burst  location  guess.  By  sending  this  erasure-locator 
polynomial  along  with  a received  code  word  to  a general  purpose  RS  decoder,  such  as 
AHA4010,  the  RS  decoder  will  decode  the  received  codeword.  The  result  outputted  by  the 
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RS  decoder  is  either  a corrected  data  or  a signal  which  indicates  no  correction  can  been 
made. 

The  erasure-locator  polynomial  is  generated  iteratively  for  all  possible  locations  during 
the  decoding  procedure.  It  is  possible  that  more  than  one  error  polynomial  results  from 
this  iterative  procedure.  When  more  than  one  error  is  obtained,  the  error  that  has  higher 
probability  of  occurrence  should  be  chosen.  It  is  assumed  in  this  paper  that  an  error  with 
smaller  weight  has  higher  probability  of  occurrence.  This  is  true  for  most  channels. 

If  the  chosen  error  is  not  the  true  error,  a miscorrect  occurs.  The  probability  of  mis- 
correct  is  a function  of  the  size  of  the  error  that  is  detected  and  the  channel  statistics.  It 
is  usually  very  low  as  shown  in  reference  1. 

The  implementation  presented  in  this  paper  is  based  on  the  AHA4010  RS  decoder. 
The  purpose  is  to  increase  the  error  correction  capability  with  very  little  increase  on  the 
hardware  and  software. 

2 Standard  Decoding  Description 

The  standard  procedure  for  decoding  the  RS  code  is  summarized  below: 

STEP  1:  Compute  syndromes 

Sj  = v(cx,+,°-1)  for  j = 1,2, ...,  2<. 

STEP  2:  From  the  syndromes,  form  the  error-location  polynomial  A(x),  where 

A(s)  = (l-xX^l  -xX2)  ...  {1-xXi)  and^.Ij,...  and  X{  are 
the  error  locations. 

STEP  3:  Find  error  location  Xj  (j  = 1,  ...,i)  by  finding  zeros  of  A(x). 

STEP  4:  Find  error  magnitude  Yj  ( j = l,...,t)  by  calculating  first  i syndrome 

equations, 

STEP  5:  Correct  the  error. 

Two  polynomials  are  needed  during  the  decoding  and  they  are: 

S(x)  = J2Sixj~l  (1) 

j=l 

and 


O(z)  = 5(x)A(z)  ( modx2t ) (2) 

This  second  equation  is  commonly  known  as  the  Key  Equation,  because  solving  it  is 
the  key  to  decoding  the  RS  code.  After  obtaining  the  error  locations,  the  error  magnitudes 
can  be  found  as: 


mi 
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x/o-wcxr1) 

A'cxr1 


(3) 


For  >0  = 1, 


A'(X-i) 


(4) 


It  is  now  clear  that  the  decoding  procedure  becomes  one  of  finding  the  A and  H poly- 
nomials from  S(x),  and  then  finding  the  location  and  magnitude  of  the  errors  from  those 
two  polynomials. 

When  erasures  are  involved,  an  erasure-locator  polynomial  is  created. 


r(*)  =n(i  - *xp) 

p 

where  the  Xp’s  are  the  erasure  locations. 

The  Key  equation  can  be  solved  for  A and  0 in  several  ways.  One  of  them  is  Euclid’s 
recursive  algorithm.  The  Euclid’s  recursive  algorithm  is  briefly  described  below.  First  let 

= x2t 

= 5(a;)r(x)  (modx2t) 

• A(-1)(x)  = 0 

A<°)(x)  = r(x) 

the  recursive  equations  are 

«'(*)  = H„«-.)W[n(i-,>(i)],  (5) 

or  equivalently, 

n<i  - 2)(x)  = gW(x)n(<-i)(x)  + nW(x)  (6) 

and 

A^(x)  = 9(1)(x)A(i-1>(x)  + A^2\x)  (7) 

The  recursion  is  continued  until  the  degree  of  fl  is  less  than  t + p/2  , where  p is  the 
number  of  erasures. 

Erasures  are  the  errors  which  have  been  located  prior  to  decoding.  Utilizing  this  infor- 
mation will  improve  the  error  correction  capability  of  the  decoder.  Since  the  burst  is  a big 
part  of  a composite  error,  a burst  erasure  will  make  the  error  correction  capability  much 
greater.  This  idea  leads  to  the  following  approach: 


STEP  1 Set  stop  conditions,  the  maximum  iteration  time  N and  n=0. 
STEP  2 Assume  the  burst  begins  at  location  a and  n=n+l. 
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Figure  1:  Block  Diagram 

STEP  3 Decode  the  error  with  the  burst  as  erasures. 

STEP  4 If  the  result  satisfies  the  stop  conditions  or  n^N,  go  to  STEF  5.  Else,  increase 
the  beginning  location  of  the  burst,  go  to  STEP  2. 

STEP  5 Report  the  result. 

In  other  words,  the  decoding  method,  used  by  the  extended  decoder,  is  to  guess  where 
the  burst  part  of  the  error  is  and  try  to  decode  it. 

3 Extended  Decoder  Design 

The  extended  RS  decoder  has  an  AHA4010  decoder  at  its  center.  An  erasure-locator  poly- 
nomial generator,  an  error  choice  unit  and  a data  buffer  are  attached  to  the  AHA4UI0 
decoder.  The  top  level  block  diagram  of  this  extended  decoder  is  shown  in  Figure  1. 

The  erasure-locator  polynomial  generator  generates  r(x).  T(x)  could  be  generated  for 
every  possible  error  location.  However,  this  may  not  be  necessary.  For  example,  let  error, 
e(x),  be  defined  as: 

ei(x)  = a®  + a9®1  + a:6x3  + a°x3  + a4x13  (8) 

The  error,  e(x),  can  be  interpreted  as 

1.  e(x)  = Ox-1  + a6  4-  a9x1  + a°x3  + a4x13 

A burst  length  of  5 (0x-i  + a®  + a^x1  + a:®x2  + a°x3  ) and  one  random  error  (a4xiS). 


2.  e(x)  = Ox  2 + Ox  1 + a®  + a9!1  + a6xJ  + x3  + a4x13. 

A burst  length  of  5 (Ox-2  + Ox-1  + a®  + a9x2  -f  a6x2  ) and  two  random  errors 
(a°x3,a4x13). 
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Figure  2:  Erasure-locator  Polynomial  Generator 


CONTROLS. 

P0  + Pj*CORRECT*(G|:  1), 

T2=  P(*CORRECT*(C  = l)  + P2*(CA>CB), 
t3=  P0  + Pl*CORRECT*(C  = l)  + P2*(CA>CB), 
T4=  Po  + P]*CORRECT*(ai). 


Figure  3:  Error  Choice  Unit 
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3.  e(x j = a6  + a®x*  + a6x2  + oc°x3  + Ox4  + Ox5  + 


,13 


A burst  length  of  5 (a®  d-a^a1  + cx®a:2  -f  a°a3  + 0c*4j  and  one  random  error  (Ox5,  a4x13. 


4.  e(x)  = Ox  1 + a6  + a^x1  + a6x2  + a°x3  + Ox4). 

A burst  length  of  5 (Ox-1  + a6  + a^x1  -f  ct®x2  + a°x3  + Ox4  ) and  two  random  error 
(Ox5,  a4x13j. 

A RS  code  with  the  ability  of  correcting  a burst  of  length  5 and  2 random  errors 
will  correct  all  the  errors  above.  Using  this  logic,  T(x)  can  be  generated  every  m error 
location  bits.  The  user  must  decide  the  value  of  m under  the  consideration  of  the  number 
of  iteration  times  and  the  size  of  the  correctable  error. 

Meanwhile,  the  error  choice  unit  stores  the  data  corrected  by  the  AHA4010  decoder 
and  reverses  it  bach  to  the  error  polynomial.  If  the  size  of  the  error  is  less  than  t’  (i.e. 
This  error  has  the  highest  probability  of  occurrence),  the  error  choice  unit  interrupts  the 
iteration  and  outputs  the  corrected  data.  Otherwise  the  iteration  continues.  If  more  than 
one  error  is  found,  the  error  choice  unit  compares  these  errors  and  the  smallest  error  is 
chosen  (It  is  assumed  that  the  smallest  error  has  the  highest  probability  of  occurrence). 


4 Erasure-Locator  Polynomial  Generator 


ssume  the  received  code  words  have  a composite  error  patterned  with  i random  errors 
and  one  burst  error  of  length  v.  The  burst  locations  may  be  ar^x,  ar"^,  ar^v,  where  r 
is  from  0 to  255,  The  erasure-locator  polynomial,  r(zj,  Has  a form: 


r(x)  = 11(1  + xa,+r) 

j'=i 


= Jl(a  r + iaJ)«r 

j=i 

= avr(rixt'  + rjx"-1  + ...  + r„x  + a-*1) 

where  I^^^andT,,  are  constant  and  r is  form  1 to  255. 

For  each  received  code  word,  the  corresponding  decoding  |>rocess  Is  performed  N/m 
times  with  N/m  different  F(x),  where  N is  the  length  of  the  RS  code  and  m is  the  bits  that 
T(x)  skips.  At  each  end  of  the  decoding  process,  a DONE  signal  is  sent  to  the  erasure- 
locator  polynomial  generator.  The  DONE  signal  causes  erasures  to  shift  to  the  right  rri 
bits.  Therefore,  a new  r(z)  is  generated.  This  operation  repeats  until  a FOUND  signal  is 
received  or  r > 255. 
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The  erasure-locator  polynomial  generator  is  depicted  in  Figure  2.  The  coefficients  of 
this  polynomial,  I^a7”-  ( j from  0 to  v),  are  not  constant.  Tj-a*"  multiply  by  a whenever 
INCREASE  CONTROL  (i.e.  DONE  signal)  is  assertive. 

The  operations  can  be  described  in  a register  transfer  language  where  each  P{  is  a 
control  state  that  defines  the  data  transfers  that  take  place  when  Pi  is  active.  A register 
transfer  language  description  for  the  erasure-locator  polynomial  generator  is  shown  below: 

• P0  : r=0,  if  GO=l,  then  go  to  P\. 

• Pi  : if  FOUND=l  or  r=255,  then  go  to  Pq  , else  To  = ar,ri  = riapr,r2  = 
r2apr,...,rp  = rpap''  and  r = r + 1. 

• P2  : T(x)  = n(l  — xctT)  , if  DONE— 1,  go  to  P\  . 

5 Error  Choice  Unit 

During  the  decoding  iteration,  it  is  possible  that  more  than  one  error  results.  The  error 
with  the  highest  probability  of  occurrence  should  be  chosen.  It  is  assumed  that  will  be  the 
smallest  error.  The  diagram  of  the  error  choice  unit  is  shown  in  Figure  3. 

The  first  data  corrected  by  the  AHA4010  decoder  is  stored  in  register  A,  its  correspond- 
ing error  is  also  calculated  and  the  size  of  the  error  is  stored  in  CA.  If  the  size  of  the  error 
is  less  than  t’,  the  CMP  asserts  the  FOUND  signal  and  outputs  the  data  in  register  A, 
The  decoding  process  otherwise  continues.  The  second  corrected  data  is  stored  in  register 
B,  the  size  of  the  second  error  is  stored  in  CB.  The  CMP  compares  the  values  of  CA  and 
CB.  If  CA  ^ C^ B , A is  replaced  by  and  C A is  replaced  by  06 • If  the  value  of  06  is  less 
than  t5,  the  CMP  asserts  the  FOUND  signal  and  outputs  the  data  in  register  A.  If  CA  < 
CB,  nothing  changes.  This  comparison  is  performed  every  time  a corrected  data  is  output 
from  the  AHA4010  decoder.  It  guarantees  that  the  register  A always  has  the  data  which 
is  corrected  from  the  smallest  error. 

A signal  from  the  erasure-locator  polynomial  generator  tells  the  error  choice  unit  that 
the  iteration  is  finished.  The  data  in  register  A is  the  output. 

A register  transfer  language  description  for  the  error  choice  unit  is: 

• P0  : 0 ->  A,0  ->  B,1  ->  C,FFH  ->  CB,  if  GO  = 1,  go  to  Px  V 

• Pi  : if  CA  < V or  CB  < V or  FLAG^l  (i.e.  r=255),  output  data,  set  F0UND=1, 
go  to  P0  . 

• if  CORRECT=l  & C=l,  correctedData  — > A,  size  (correctedData)  — > CA,  c = c + 1; 

• if  C0RRECT=1  & C \ 1,  correctedData  — » B,  size  (correctedData)  — > CB; 

• P2  : if  CA  > CB,  B -+  A,  CB  ->  CA,  go  to  Px. 

CORRECT  is  a signal  from  the  AHA4010  decoder  which  indicates  a correction  has 
or  has  not  been  made.  C is  a counter.  It  counts  the  number  of  correction  times  for  one 
received  code  word. 
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6 An  Example 

Consider  a (255,235)  RS  code  over  GF(28)  defined  by  the  primitive  polynomial  p(x)  = 
x8  + x7  + + a:1  + 1 with  the  primitive  element  a — x.  This  code  can  normally  correct 

ten  random  errors.  Assume  received  errors  have  a burst  of  length  8 and  5 random  errors. 
After  considering  the  number  of  iteration  times  and  the  size  of  the  correctable  error,  let’s 
set  the  m=4  and  t’=ll. 


SOLUTION: 

The  received  polynomial  is: 

/ \ 14  . 3 15  I 200  16  I 8 17  . ^40  18  , 23  19  , 6 20  ■ 21  , 54  183  , 71  198  , 233 

v(xj  = x +a  x +a  x +a  x +a  x + a * + a x +x  +a  x 4-a  x +ax  . 

(9) 

When  the  extended  RS  decoder  is  turned  on,  the  erasure-locator  polynomial  is: 


r(*)  = TIC1  d-xa1)-  (10) 

i=i 

This  r(x)  is  sent  to  the  AHA4010  decoder,  the  FOUND  signal  is  zero.  Multiply  the 
coefficients  of  T(x)  by  a32  (i.e.  aVT  = a4'8  = a32).  The  erasure-locator  polynomial  becomes: 


8 

r(x)  = JJ(1  + zoda4) 

3= 1 

and  this  new  I^(x)  is  sent  to  the  XHAlOlO,  the  B'OtlND  signal  is  still  zero.  This  decoding 
process  performs  repeatedly  until  the  FOUND  signal  is  one.  That  gives  the  corrected  data: 

J {0,0,0,...,0} 

The  corresponding  erasure-locator  polynomial  is: 

T{x)=Jl{l  + xa^ali)  (11) 

i 

and  the  corresponding  error  polynomial  is: 


= x14  + a3x15 


+ Oi  X 


+ a8x17  + a4V8 


+ a23x19  + aV° 


-I-  x 


21 


• 54  183 

+ a x 


+ a71x198 


+ OC 


233 


7 Summary 

An  extended  RS  decoder  has  been  presented  in  this  paper.  With  two  extra  circuits,  the 
error  correction  capability  of  a general  purpose  RS  decoder  can  be  increased.  This  design 
shows  a way  to  improve  the  error  correction  capability  of  existing  RS  decoders.; 
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Abstract-  Speed  requirements  have  been,  and  will  continue  to  be  a major  con- 
sideration in  the  design  of  hardware  to  implement  digital  signal  processing 
functions  like  digital  filters  and  transforms  like  the  DFT  and  DCT.  The  con- 
ventional approach  is  to  increase  speed  by  adding  hardware  and  increasing 
chip  area.  The  real  challenge  is  to  save  chip  area  while  still  maintaining  high 
speed  performance.  The  approach  we  propose  is  based  on  the  distributed 
arithmetic  implementation  (DA)  of  digital  filters.  The  improvement  is  based 
on  two  observations.  Firstly,  a single  memory  element  can  replace  several 
identical  memory  elements  in  a fully  parallel  DA  implementation.  Secondly, 
truncation  or  rounding  may  be  introduced  into  the  computation  at  strategic 
points  without  increasing  error  unduly.  Both  of  these  approaches  can  be  used 
to  attain  area  savings  without  impairing  speed  of  operation. 

1 Introduction 

Finding  the  inner  product  between  two  vectors  is  an  operation  that  commonly  arises  in  sig- 
nal processing  as  well  as  in  general  data  processing.  Digital  convolution  and  correlation  are 
directly  described  as  inner  products.  Other  operations  such  as  the  discrete  Fourier  trans- 
form and  other  common  transforms  can  be  implemented  as  a sequence  of  inner  products. 
Consider  the  inner  product 

V = ^Tj  AfcXfc  (1) 

fc= i 

In  the  case  of  a FIR  digital  filter,  Ak  represents  a set  of  fixed  weights,  and  X*  represents 
the  current  and  past  K — 1 filter  inputs.  The  inner  product  can  be  implemented  directly 
by  using  a single  multiplier  and  an  accumulator  in  a serial  one  product  at  a time  manner, 
as  in  Figure  1,  or  in  a fully  parallel  manner  by  using  K multipliers  and  a multi- input  adder 
or  adder  tree,  as  in  Figure  2.  Obviously,  the  fully  parallel  architecture  will  always  be  faster 
than  the  serial  approach. 

The  distributed  arithmetic  (DA)  approach  to  computing  the  inner  product  was  devel- 
oped in  the  early  seventies  [1,2, 3,4, 5, 6, 7, 8].  In  this  approach,  combinations  of  the  Ak  are 
precomputed  and  stored  in  memory.  Input  data  are  used  to  identify  which  memory  words 
are  to  be  fetched,  shifted  and  added  to  produce  the  final  result.  Without  loss  of  generality, 
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first  assume  that  the  x ^ are  scaled  such  that  [x^|  < i.  In  two’s  complement  form 

_____ M-l  . : -•  r . • 

*k  = ~h 0 + ]C  (2) 

ni==l 

where  the  6<,m  represent  the  individual  bits  m x * with  bko  the  sign  bit.  Substituting  (2) 
into  (1)  and  rearranging  the  order  of  summation  gives 

M-i  ( K \ k 

y = E £ } 2‘m  - £ Mu  (3) 

m-1  Kk=l  J k=l 

Since  the  bits  bhm  are  either  0 or  1,  the  term  ^k^km  can  be  precomputed  for  all 
2K  possible  combinations  of  bkm.  These  values  are  then  stored  in  a ROM  or  RAM.  The 
actual  combinations  of  bkm,  arising  out  of  the  input  data,  are  used  to  address  one  of  the 
precomputed  terms  stored  in  the  memory.  Note  that  these  combinations  are  formed  by 
selecting  the  mth  bit  from  each  of  the  K ilf-bit  input  words.  The  mth  term  so  addressed 
is  then  shifted  by  m bits  to  the  right  before  being  added  to  the  other  M terms.  The  only 
exception  to  this  is  when  m — 0.  In  this  case,  which  corresponds  to  using  the  sign  bits  of 
the  input  data  to  form  the  address,  the  addressed  term  is  subtracted  from  the  other  terms. 

As  with  the  direct  implementation  of  the  inner  product,  there  are  two  approaches  to 
implementing  DA.  The  inner  product  can  be  computed  by  using  only  one  memory  and  a 
single  accumulator  as  shown  in  Figure  3,  or  in  a fully  parallel  manner  by  using  M memories, 
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Figure  3:  Single  Memory  DA  Implementation 

each  with  identical  contents,  as  shown  in  Figure  4.  Again,  the  fully  parallel  approach  will 
always  be  the  fastest.  In  Figure  4,  we  note  that  the  shifting  is  actually  accomplished  by 
connecting  the  memory  outputs  to  appropriate  positions  on  a multi-input  adder. 

Comparing  Figures  4 and  2 is  instructive.  We  note  that  where  the  input  data  words 
are  M bits  wide,  M memories  are  always  used  in  the  DA  implementation,  independent 
of  K,  the  number  of  multiplies  in  the  inner  product.  However,  each  memory  must  store 
2K  terms.  So,  increasing  K will  increase  the  required  size  of  the  memories.  Also,  as  K 
increases,  the  number  of  stored  bits  per  term  must  increase  in  order  to  maintain  accuracy. 
The  direct  implementation,  by  comparison,  uses  K multipliers.  As  M increases,  the  width 
and  depth1  of  the  multipliers  must  increase  to  preserve  accuracy.  Thus,  depending  on 
the  word  size,  number  of  products,  and  required  accuracy,  one  approach  may  have  size 
advantages  over  the  other. 

In  terms  of  speed,  DA  does  have  one  clear  advantage  over  the  direct  implementation. 
Increasing  the  accuracy  of  the  inner  product  by  increasing  the  number  of  bits  in  the 
input  data  words  and  in  the  coefficients  will  not  degrade  the  speed  performance  of  a DA 
implementation.  The  number  of  memories  and  the  width  of  each  will  increase,  but  the 
number  of  stored  terms  in  each  memory  will  not.  In  a direct  implementation,  however, 
not  only  the  width  of  the  multipliers  increase,  but  so  will  their  depth  resulting  in  slower 
performance.  Increasing  K does  not  decrease  the  speed  of  the  multipliers,  but  it  will 
increase  the  depth  of  the  adder  tree  in  the  direct  implementation  resulting  in  some  loss  of 
performance.  In  a DA  implementation,  increasing  K will  slow  down  the  memories,  but  it 
does  not  increase  the  depth  of  the  adder  tree. 

While  the  structure  of  the  fully  parallel  DA  implementation  is  very  regular  and  hence 
attractive  for  VLSI  implementation,  it  appears  to  be  very  inefficient  in  terms  of  its  use 
of  space.  That  is,  for  each  inner  product  computed,  only  one  of  the  2K  terms  stored  in 
each  memory  is  used.  Further,  the  contents  of  each  of  the  M memories  is  identical.  Our 


1This  assumes  that  the  width  of  the  coefficients  also  increases  proportional  to  M . 
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Figure  4:  Fully  Parallel  DA  Implementation 


first  observation  about  how  the  fully  parallel  DA  architecture  may  be  improved  involves 
replacing  the  M memories  with  just  one  memory  unit  that  provides  M data  access  paths, 

2 An  Improved  DA  Architecture 

A ROM,  using  one  transistor  per  storage  bit,  is  shown  in  Figure  5.  The  stored  bits  are 
zero  or  one  depending  on  whether  the  drain  of  the  associated  transistor  is  connected  to 
the  data  line  or  not.  Note  that  the  address  decoder  and  the  data  line  sense  amplifiers  are 
not  shown.  Not  counting  these  components,  the  number  of  transistors  required  for  the 
memories  in  a RDM  based  fully  parallel  DA  implementation  is 

nt  = M{b2K  +2  * + 6)  (4) 

where  b represents  the  number  of  bits  stored  in  each  word  of  the  memory.  Next,  consider 
Figure  6 which  represents  one  plane  of  a M- way  multi-access  memory.  Each  plane  stores 
one  word  and  2K  planes  together  make  up  the  complete  memory  unit  as  shown  in  Figure 
7.  Each  plane  has  M sets  of  b control  transistors  that  are  used  to  route  the  stored  word  to 
the  appropriate  output  register.  Each  set  of  fc  control  transistors  is  controlled  by  a single 
control  line.  The  data  bits  associated  with  control  line  m in  each  plane  are  connected  to 
a bus  which  connects  with  output  register  m._  Which  of  the  2K  control  lines  is  asserted 
is  determined  by  address  decoder  m.  Since  this  circuit  effectively  addresses  the  output 
registers  instead  of  the  stored  words,  there  is  no  need  for  address  lines  for  the  stored  words 
themselves.  Further,  a transistor  is  not  required  for  each  stored  bit.  A zero  is  stored  simply 
with  a shorted  line,  a one  with  an  open.  The  control  transistors  assume  the  function  of  the 
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Figure  5:  ROM  Architecture 


storage  transistors  in  the  ROM  architecture,  and  provide  a path  between  every  stored  word 
in  the  memory  unit  and  every  output  register.  The  total  number  of  transistors  required  to 
implement  the  memory  unit,  again  excluding  address  decoders  and  output  registers  is 

nt  = M62*  + 62*  + 2*  (5) 

Note  that  both  approaches  use  M K to  2K  decoders  and  identically  sized  adder  trees.  The 
ratio  of  the  number  of  transistors  in  the  storage  sections  of  the  Multi- Access  memory  unit 
and  the  memories  in  a fully  parallel  DA  architecture  give  an  estimate  of  the  area  savings 
potential  presented  by  one  approach  over  the  other.  Dividing  (5)  by  (4)  gives 

Mb2K  + b2K  + 2K  M62*  + (b  + 1)2* 

Karea  “ Mb2 * + M2K  + Mb  ~ Mb{2k  + 1)  + M2 * 

Since  b will  usually  be  greater  than  M,  RaTea  > 1.  That  is,  the  multi-access  memory 
architecture  presents  no  area  savings,  despite  the  fact  it  replaces  M copies  of  each  stored 
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bus  1 


bus  2 


bus  M 


Figure  6:  Memory  Unit  Plane 


word  with  just  one  copy.  This  is  because  the  expense  of  Bus  controls  erases  the  savings  of 
memory  transistors.  In  fact  the  number  of  transistors  associated  with  stored  bits  in  the 
fully  parahe!  implementation  is  Mb2K . This  is  also  the  number  of  bus  control  transistors 
in  the  multi-access  memory.  However,  for  memory  architectures  where  there  is  more  than 
one  transistor  per  stored  bit,  the  savings  in  storage  transistors  will  not  be  absorbed  by  bus 
control  transistors.  To  see  this,  consider  the  case  that  4 transistor  static  RAM  cells  are 
used  as  memory  elements.  Static  RAM  may  be  required  in  cases  where  the  inner  product 
is  to  be  configurable,  in  the  sense  that  the  coefficients  may  be  changed  from  time  to  tithe, 
requiring  the  memory  contents  to  be  rewritten.  ^The  conventional  static  RAM  architecture 
is  shown  in  Figure  8 and  one  plane  of  the  multi-access  memory  architecture  is  shown  in 

Figure  9.  . ..  v-  . A--:  - v - . 4- 

A fully  parallel  implementation  of  DA  using  M static  RAMs  of  the  type  shown  in 
Figure  8 would  use  4Af62R’  transistors  and  Mb2K  cell  select  transistors.  The  static  RAM 
multi-access  memory  unit  would  use  462^  transistors  for  storage  and  Mb2K  bus  control 
transistors.  Thus,  when  static  RAM  cells  are  used, 

(M  + 4)62*  M + 4 m 

Rarta^  5Mb2K  ~ 5 M 

Here  there  will  be  an  area  savings  so  long  as  M (the  number  of  bits  in  the  input  data  words 
and  the  number  of  memory  units  in  the  fully  parallel  DA  implementation)  is  greater  then  1. 
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bus  1 


bus  2 


bus  M 


If  M — 8,  Rare  a = -30.  Thus  the  area  savings  can  be  significant.  These  observations  have 
been  made  in  the  context  that  the  number  of  transistors  corresponds  to  area  requirements. 
The  same  results  hold  whenever  the  area  required  for  the  storage  cell  (be  it  transistor  based 
or  not)  requires  more  area  to  implement  than  do  bus  control  transistors.  Again,  it  must  be 
noted  that  while  the  area  required  for  address  decoders  and  the  adder  tree  are  the  same  for 
both  implementations,  these  requirements  are  not  included  in  the  Rarea  computation.  So, 
Rarea  only  reflects  the  savings  potential  in  the  storage  section  of  the  implementation.  To 
the  degree  that  the  storage  section  dominates  the  other  elements  of  the  implementation  this 
may  translate  into  significant  savings.  Not  only  do  the  other  elements  need  to  be  included 
in  the  computations,  but  an  actual  VLSI  layout  of  a multi-access  memory  based  DA  circuit 
needs  to  be  attempted  to  make  sure  that  the  connection  complexity  of  the  multi-access 
memory  unit  does  not  overwhelm  what  appears  to  be  a significant  area  savings  potential 
in  the  case  of  static  RAM  memory  cells. 


3 Truncation  and  Rounding 

Once  each  term  is  fetched  from  the  memory,  they  are  shifted  and  added  to  form  the  final 
result.  This  operation  is  diagramed  in  Figure  10.  If  each  memory  in  a fully  parallel  imple- 
mentation stores  terms  that  are  b bits  wide,  then  the  resulting  inner  product  will  occupy 


3.5.8 


address  line  1 


Figure  8:  Static  RAM 

at  most  M + 6 bits.2  Suppose,  however,  that  the  product  only  needs  to  be  determined  to 
an  accuracy  of  f significant  bits.  In  this  case  it  may  be  possible  to  truncate  or  round  the 
individual  terms  before  adding.  Doing  so  would  not  only  reduce  the  amount  of  hardware 
required  in  the  adder  tree,  but  would  also  reduce  the  size  of  some  of  the  memories.  This 
is  obvious  in  the  case  of  the  fully  parallel  DA  implementation  and,  as  we  will  see  later,  it 
is  also  true  for  the  multi-access  memory  unit  based  implementation.  First  let  us  consider 
what  the  impact  of  truncating  or  rounding  will  be  on  the  accuracy  of  the  final  result. 

Truncating  the  individual  terms  and  discarding  bits  that  fall  in  column  / + e and  to 
the  right  (as  shown  in  Figure  10)  will  give  a maximum  worst  case  error  of 

Ett  = 2-e((Af  + b-(f + e))-l+2~lM+b-V+'»)  (8) 

where  we  have  normalized  the  result  so  that  the  binary  point  falls  just  to  the  left  of  column 
/.3  The  worst  case  truncation  error  is  calculated  by  considering  that  all  the  truncated  bits 
are  ones. 

When  we  round  the  individual  terms  and  then  discard  bits  that  fall  in  column  / + e 
there  are  two  worst  case  error  situations.  If  the  bits  in  column  / + e axe  all  ones,  and  all 
bits  to  the  right  are  zeros.  In  this  case  the  error  is 

ETta  = 2-('+1\M  + b-(f  + e))  (9) 

2This  can  be  shown  by  temporarily  treating  the  terms  as  whole  integers  and  assuming  that  all  M terms 
take  on  the  maximum  value  (26  — 1).  The  final  sum  will  then  be  (2h  — 1)  * (2M  — 1)  which  can  be  written 
as  ((2h+M  — 1)  — (2^  — 1)  — 2h)  -b  1.  When  written  this  way  and  assuming  M < b it  is  easy  to  see  that  the 
result  occupies  at  most  M + 6 bits. 

3 We  also  need  f >b. 
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bus  1 


bus  M 


Figure  9:  Static  RAM  Multi- Access  Memory  Plane 


If  the  bits  in  column  / + e are  zeros  and  all  the  bits  to  the  right  are  ones,  the  error  is 

Ertb  = 2 + b-(f  + e + 1))  - 1 + 2-(Af+fc-(/+e+1»)  (10) 

After  the  individual  terms  have  been  truncated  or  rounded  to  / + e bits,  the  final  sum 
is  computed  and  then  either  truncated  or  rounded  to  / bits.  This  second  truncation  or 
rounding  will  add  to  the  total  error.  In  the  case  of  truncation,  the  worst  case  error  will  be 

Et}  = 1 - 2-<e-1>  (11) 

In  the  case  of  rounding,  the  additional  worst  case  errors  are 

Erf  a = 2"1 
Er}h  = 2-1  - 2-<‘-1> 

Noting  that  Erta  > Ertb  and  Erfa  > Erjb , we  will  use  Erta  and  Erja  when  referring  to 
rounding.  There  are  four  possible  approaches  to  arriving  at  the  final  result  depending  on 
which  of  truncation  or  rounding  is  applied  to  the  individual  terms  and  which  is  applied 
to  the  final  sum.  The  four  possibilities  are  summarized  in  Figure  11.  From  the  graph,  we 
see  that  as  few  as  five  or  six  extra  bits  beyond  / are  required  in  order  to  arrive  at  errors 
that  are  very  near  what  we  would  expect  if  we  retained  all  the  bits  in  the  individual  terms, 


(12) 
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retain  f bits  discard  t bits 

Figure  10:  The  Final  Sum 

formed  the  sum  and  rounded  or  truncated  the  result  to  / bits.  We  also  note  that  this  is  true 
independent  of  whether  the  individual  terms  are  first  truncated  or  rounded.  When  e < 4, 
the  error  resulting  from  rounding  is  similar  to  the  error  resulting  from  truncating  one  less 
bit.  These  observations  suggest  that  there  is  not  a great  deal  to  be  gained  by  rounding 
individual  terms  over  truncating  them.  In  the  case  of  a fully  parallel  DA  implementation, 
the  rounding  of  individual  terms  can  be  precomputed  and  only  the  rounded  terms  stored, 
so  there  is  no  cost  in  doing  so.  However  in  the  multi-access  memory  based  implementation, 
rounding  of  individual  terms  would  have  to  be  performed  upon  access.  As  we  shall  see 
shortly,  the  area  requirement  of  the  multi-access  memory  for  rounding  will  be  the  same 
for  a memory  the  truncates  one  less  bit.  This,  coupled  with  the  above  observations  on 
error,  indicate  that  rounding  individual  terms  does  not  provide  a very  great  advantage 
over  truncation  in  the  multi-access  memory. 


4 Implementing  Truncation  and  Rounding 

First  we  consider  the  transistor  cost  of  a ROM  based  fully  parallel  DA  implementation. 
From  Figure  10  we  see  that  if  we  desire  to  compute  the  final  sum  to  / + e bits,  we  will  need 
(Af  - i)  ROMs  storing  b bit  words,  and  t ROMs  that  each  store  one  bit  less  in  succession 
where  i = (M  + &)  — (/  + e).  Note  that  for  consistency,  t < b and  t < M.  If  t > b,  M 
should  be  reduced  and  if  t > M,  b should  be  reduced.  Now,  referring  to  Figure  5 we  see 
that  for  each  bit  truncated,  2K  + 1 transistors  are  saved.  Since  i = <(i  + l)/2,  the 
cost  of  the  implementation  is 

n,  = M(b2K  + 2*  + i)  - + 1) 


(13) 
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M = 1 6,  b = 32,  f =32 


e (number  of  extra  bits) 


Figure  11:  Errors  from  Truncation  and  Rounding 

Again,  K to  2K  decoders  are  required  for  all  M ROMs.  We  also  note  that  this  equation 
applies  equally  well  to  both  truncation  and  rounding  of  individual  terms.  Next  we  consider 
the  transistor  cost  of  the  multi-access  ROM  based  implementation.  We  note  that  while 
each  stored  term  must  have  the  full  b bits,  the  width  of  the  last  t data  paths  decreases  by 
one  bit  for  each  path.  This  is  shown  in  Figure  12  where  the  storage  cells  are  implemented 
by  connections  (or  the  lack  thereof)  to  ground  through  a precharge  transistor  as  in  Figure 
7.  Thus,  we  save  t{t  + l)/2  bus  control  transistors  in  each  plane  so,  the  overall  cost  of  the 
implementation  is 

n*  = Mb2K  + b2K  + 2k  - (14) 

Comparing  the  savings  of  the  two  approaches,  we  see,  not  surprisingly,  that  the  multi- 
access ROM  continues  to  loose  ground  against  the  fully  parallel  implementation.  The 
disadvantage  is  further  amplified  when  we  consider  applying  rounding  to  individual  terms. 
In  the  fully  parallel  approach,  the  rounding  is  precomputed,  but  in  the  multi-access  ap- 
proach the  rounding  must  be  computed  on-line.  An  extra  bit  in  each  of  the  terms  to  be 
rounded  is  required.  If  the  bit  is  a one,  the  one  is  to  be  added  to  the  next  more  significant 
bit.  This  could  be  achieved  by  routing  the  extra  bit  to  an  appropriate  place  in  the  adder 
tree.  Another  approach  might  be  to  truncate  so  that  a final  sum  of  / + e + 1 bits  is 
computed,  resulting  in  an  equivalent  error.  In  either  case,  an  extra  bit  would  be  needed 
for  each  of  the  M data  paths  in  each  of  the  2K  planes. 

Extending  the  comparison  to  the  use  of  static  RAM,  from  Figures  8 and  10,  we  see  that 
truncating  or  rounding  so  that  the  final  sum  is  computed  to  / + e bits  would  save  (5 i{t  + 
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data  bit  b ) 

Figure  12:  Multi-access  Plane  with  Truncation 


l)/2)2  transistors  in  the  fully  parallel  implementation.  Letting  the  storage  elements  in 
Figure  12  be  the  4 transistor  cells  used  in  Figure  9,  we  see  that  the  savings  from  truncation 
in  the  multi-access  RAM  based  implementation  is  the  same  as  it  was  in  the  multi-access 
ROM,  namely  2 Kt(t  -(- 1)/2.  Not  surprisingly,  we  find  that  the  fully  parallel  RAM  based 
implementation  benefits  more  from  truncation  than  does  the  multi  access  based  RAM 
architecture.  We  note  however,  that  it  still  possesses  a significant  advantage.  The  ratio  of 
the  number  of  transistors  becomes 


(M  + 4 + l)/2 

5 (Mb  - t{t  -f  l)/2) 


With  f — M = 8 and  b = 16,  Rarea  = 0.34  as  compared  to  the  .30  ratio  that  arises  if  t — 0. 
We  also  note  that  the  reduced  number  of  bus  control  transistors  and  reduced  bus  widths 
reduces  the  connection  complexity  of  the  multi-access  architecture. 


5 Conclusions 

We  have  shown  that  the  multi-access  architecture  requires  significantly  less  area  than  a 
fully  parallel  architecture  when  the  number  of  transistors  per  stored  bit  is  greater  than 
one,  as  it  will  be  when  static  RAM  cells  are  employed.  Since  this  observation  is  based 
on  the  assumption  that  the  transistors  used  for  storage  are  the  same  size  as  those  used 
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for  bus  control,  we  can  say  more  abstractly  that  the  multi-access  architecture  will  save 
space  whenever  the  area  required  to  implement  each  storage  cell  is  greater  than  the  area 
required  to  implement  a bus  control  or  routing  transistor.  The  savings  estimates  do  not 
include  the  cost  of  decoders  and  the  cost  of  the  adder  tree  (which  will  be  the  same  in  both 
cases).  The  area  requirements  of  these  elements  must  be  added  in  so  we  can  truly  asses 
the  area  savings  advantage  of  our  approach.  Both  approaches  appear  to  be  fairly  regular, 
both  lending  themselves  well  to  VLSI  implementation.  Again  this  observation  is  made 
independently  of  the  implementation  of  the  decoders  and  the  adder  tree.  The  connection 
complexity  between  these  elements  in  both  architectures  also  needs  to  be  considered.  In 
short,  a VLSI  layout  of  both  architectures  needs  to  be  done  in  order  to  be  able  to  accurately 
compare  the  two. 

We  have  also  presented  the  errors  associated  with  truncating  or  rounding  individual 
terms  and  the  area  savings  that  can  result  in  both  architectures  from  doing  so.  These 
errors  need  to  be  reconsidered,  placing  them  in  the  overall  context  of  the  inner  product. 
In  particular  we  have  not  considered  what  6 should  be  given  M and  K . 
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An  Analog  Retina  Model  for  Detecting  Dim  Moving 
Objects  Against  a Bright  Moving  Background 

R.  M.  Searfus,  M.  E.  Colvin,  F.  H.  Eeckman, 

J.  L.  Teeters,  and  T.  S.  Axelrod 
Lawrence  Livermore  National  Laboratory 
Livermore,  California  94550 

Abstract  - We  are  interested  in  applications  that  require  the  ability  to  track 
a dim  target  against  a bright,  moving  background.  Since  the  target  signal 
will  be  less  than  or  comparable  to  the  variations  in  the  background  signal 
intensity,  sophisticated  techniques  must  be  employed  to  detect  the  target.  We 
present  an  analog  retina  model  that  adapts  to  the  motion  of  the  background 
in  order  to  enhance  targets  that  have  a velocity  difference  with  respect  to  the 
background.  Computer  simulation  results  and  our  preliminary  concept  of  an 
analog  WZ”  focal  plane  implementation  are  also  presented. 

1 Introduction 

We  are  interested  in  air  and  spaceborne  surveillance  applications  that  require  real-time 
target  detection  and  tracking  against  a moving  earth  background.  The  scene  observed 
from  the  surveillance  platform  may  range  from  a dark  earth  to  bright  sunlit  clouds  and 
terrain,  and  the  variation  in  the  intensity  of  a single  scene  may  span  several  orders  of 
magnitude.  As  long  as  the  target  intensity  is  sufficiently  larger  than  the  variations  in 
the  background  intensity,  simple  image  processing  techniques  such  as  spatial  filtering  and 
thresholding  can  produce  satisfactory  results;  however,  for  the  case  of  a dim  target  against 
a bright,  moving  background,  simple  processing  methods  may  produce  unacceptable  levels 
of  false  detections  or  may  completely  fail  to  detect  the  target. 

One  approach  for  reliably  detecting  and  tracking  targets  under  these  conditions  is  to 
subtract  the  moving  background  from  the  scene,  leaving  only  those  objects  that  have  a 
different  velocity  than  the  background.  Since  the  background  motion  may  not  be  known  a 
priori  and  may  change  throughout  the  course  of  observation,  it  is  important  for  the  sensor 
to  have  the  capability  of  adapting  to  changes  in  the  background  velocity.  The  number  of 
detector  signals  that  must  be  simultaneously  processed1  imposes  a computational  demand 
that  exceeds  the  capability  of  conventional  computer  hardware.  Furthermore,  for  a space 
environment,  low-power  consumption  and  compact  size  are  extremely  important  design 
constraints. 

In  this  paper,  we  present  a model  for  an  analog  retina  that  adapts  to  the  motion 
of  the  background  and  enhances  objects  having  a velocity  difference  with  respect  to  the 
background.  A computer  simulation  of  this  model  is  described,  and  our  experience  of 
using  the  simulation  on  real  and  synthetic  data  is  discussed.  We  also  describe  a real- 
time implementation  of  our  model  on  a PIPE  image  processing  computer,  and  present  a 


1 A minimum  detector  array  of  128x128  pixels  is  required;  an  array  of  512x512  pixels  is  desired. 


4.1.2 


mapping  of  our  model  to  a “Z”  focal  plane  (Z-plane)  technology  [?]  implementation  that 
addresses  the  real-time  processing  requirements  and  the  design  constraints  for  space-based 
operations. 

2 An  Analog  Retina-like  Model 

Very  sensitive,  high-resolution  electronic  imaging  systems  exist  with  capabilities  that  sur- 
pass those  of  any  biological  system.  However,  current  electronic  imaging  systems  do  not 
possess  the  robustness  of  a biological  system  when  confronted  with  a diverse  environment, 
and  also  lack  the  real-time  processing  power  of  even  the  simplest  vertebrate  retina.  For 
the  relatively  simple  task  of  identifying  and  fracking  moving  objects,  man-made  devices 
fall  short  of  the  biological  systems  they  are  designed  ter  mimic; " - - --  - 

The  goal  of  our  research  effort  Has  been  to  extract  and  understand  the  engineering 
principles  underlying  natural  vision  systems  and  to  apply  that  knowledge  to  designing 
better  image  processing  hardware.  We  are  focusing  on  the  retina  because  research  has 
shown  that  some  animals  possess  enough  image  processing  “wetware”  to  detect  and  track 
moving  objects  using  only  a thin  layer  of  cells  at  the  back  of  the  eyecup  (the  retina). 

The  vertebrate  retina  is  more  than  just  a simple  light  sensor.  It  is  a complex  sensor- 
processor  device  that  transforms  the  incoming  light  signal  before  transmitting  it  to  the 
visual  cortex  and  other  subcortical  regions.  The  retina*s  full  range  of  functions  are  presently 
unknown,  but  it  is  clearly  involved  in  dynamic  range  adjustments,  edge  enhancement,  color 
preprocessing,  and  change  detection.  The  retina  has  five  main  cell  types  (photoreceptors, 
horizontal  cells,  bipolar  cells,  amacrine  cells,  and  ganglion  cells)  and  two  synaptic  layers, 
the  inner  and  outer  plexiform  layers,  where  the  processes  of  these  retinal  cells  interact 
to  produce  nontrival  signal  transformations.  The  outer  plexiform  layer  handles  spatial 
processing  and  dynamic  range  adjustments,  while  the  inner  plexiform  layer  is  involved  in 
change  detection  and  temporal  processing.  A detailed  description  of  the  anatomy  and 
physiology  of  the  vertebrate  retina  can  be  found  in  Dowhng  [?].  Wc  must  emphasize  that 
we  are  not  trying  to  duplicate  the  biological  retina.  Rather  we  have  borrowed  several 
design  principles  from  the  retina  (especially  the  outer  plexiform  layer)  to  solve  a specific 
image  processing  problem,  izz  F z ^ _ _ „ / 

Our  model  consists  of  three  major  components  as  shown  in  the  block  diagram  of  Fig- 
ure 1:  an  artificial  retina,  augmented  by  a background  removal  network,  and  an  image 
enhancement  network.  Processing  throughout  the  model  is  performed  on  analog  data, 
eliminating  the  need  of  analog-to-digital  and  digital-to-analog  conversion. 

The  artificial  retina  is  based  oh  our  previous  work  involving  the  use  of  a retina-like 
model  for  detecting  moving  objects  against  a fixed  background  [?],  and  consists  of  two 
parts:  a photodefector  array  analogous  to  photoreceptors  found  in  biological  vision  sys- 
tems; and  an  image  conditioning  network  that  mimics  the  function  of  horizontal  and  bipolar 
cells  in  the  biological  retina.  The  photodetector  array  is  a mosaic  of  photosensitive  devices 
that  convert  light  into  an  electrical  signal.  Unlike  a CCD  which  produces  discrete  frames  of 
time-averaged  data,  the  photodetector  array  produces  a continuous,  time-varying  image. 
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Background  Removal  Network 


Figure  1:  This  block  diagram  shows  the  three  major  components  of  our  model:  an  artificial 
retina;  a background  removal  network;  and  an  image  enhancement  network.  The  artificial 
retina  converts  light  into  a continuous,  time-varying  image,  and  conditions  this  image 
with  amplification  and  spatial-temporal  noise  reduction.  The  background  is  removed  by 
network  layers  which  subtract  a shifted,  time-delayed  image  from  the  conditioned  image, 
and  the  result  is  used  to  enhance  the  output  of  the  artificial  retina.  Further  analog  and 
digital  processing  can  be  performed  on  the  enhanced  image  to  meet  application- specific 
requirements. 
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The  output  of  the  photodetector  array  is  amplified  by  the  image  conditioning  network, 
which  also  provides  temporal  and  spatial  noise  reduction. 

The  background  is  subtracted  from  the  artificial  retina  output  by  the  background  re- 
moval network.  To  perform  this  operation,  output  from  the  artificial  retina  is  first  delayed 
by  a low-pass  temporal  filter  (depending  on  the  application,  this  delay  can  either  be  fixed 
or  variable).  The  delayed  image  is  then  spatially  shifted  by  a neural  network  layer.  The 
adjustable  weights  of  this  image-shifting  network  determine  the  total  spatial  displacement. 
A difference  image  is  then  formed  by  subtracting  the  delayed,  shifted  image  from  the  arti- 
ficial retina  output.  If  the  weights  of  the  image- shifting  network  are  adjusted  correctly,  the 
background  in  the  image  shifter  output  will  be  aligned  with  the  background  in  the  artificial 
retina  output,  and  the  backgrounds  will  cancel  in  the  difference.  Objects  having  a velocity 
difference  with  respect  to  the  background  will  leave  a negative  trace  at  the  trailing  end  of 
the  object  and  a positive  trace  at  the  leading  end. 

An  error  feedback  network  modifies  the  weights  of  the  image-shifting  network  to  achieve 
and  maintain  background  alignment.  The  drift  error  is  determined  by  rectifying  the  dif- 
ference image  and  summing  the  rectified  values  (1). 

|E|2  = f (7(f,t)  - I(r  + offset, i + delay))2  (1) 

J Image 

As  the  backgrounds  are  shifted  towards  alignment,  the  error  decreases;  as  the  back- 
grounds are  shifted  away  from  alignment,  the  error  increases.  In  a one-dimensional  case, 
the  derivative  of  the  error  with  respect  to  the  spatial  shift  will  determine  the  required 
shift  direction  necessary  to  bring  the  backgrounds  into  alignment  (see  Figure  2).  For  a 
two-dimensional  case,  the  gradient  of  the  error  with  respect  to  the  X/Y  spatial  shift  will 
determine  the  shift  direction.  The  magnitude  of  the  gradient  scaled  by  a feedback  gain 
can  be  used  to  determine  the  shift  distance.  To  allow  a more  detailed  analysis  of  the  error 
feedback,  we  have  been  studying  images  moving  in  a single  dimension  only. 

It  is  possible  to  estimate  the  limits  on  the  accuracy  of  the  offset  determination  due 
to  the  use  of  this  simple,  aggregate  error  signal.  (More  complex  and  computationally 
expensive  shift  error  measures  are  possible,  such  as  calculating  a pixel-by-pixel  brightness 
correlation  function).  To  determine  the  effect  of  the  background  clutter,  we  can  compute 
the  change  in  the  sensitivity  of  the  error  signal  for  different  background  spatial  frequencies. 
If  we  assume  the  background  is  a one  dimensional  sinusoidal  grating,  then  the  magnitude 
of  the  error  signal  (as  a fraction  of  its  maximum  possible  value)  is  given  by  (2).  _ 


E\2  = 1 — COS  (-7T  * 


offset  error  ( pixels ) 
background  wavelength  ( pixels ) 


(2) 


This  result  indicates  that  for  very  low  spatial  frequency  backgrounds  the  error  signal 
due  to  an  offset  of  a single  pixel  will  be  extremely  small,  (e.g.  0.03%  for  clutter  with  a 
wavelength  of  128  pixels)  and  will  limit  the  accuracy  of  the  offset  optimization.  However, 
for  the  applications  we  are  studying,  the  background  will  be  rich  in  spatial  frequencies,  and 
this  error  signal  is  quite  adequate  (e.g.  5.0%  for  clutter  with  a wavelength  of  10  pixels). 
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It  is  important  that  the  model  be  robust  to  environment  and  sensor  noise.  An  advantage 
of  the  aggregate  error  signal  used  in  our  algorithm  is  that  temporally  uncorrelated  noise 
will  be  averaged  out  in  the  sum  over  the  image.  To  a first  approximation,  the  variance 
of  the  error  signal  will  be  lower  than  the  variance  of  the  raw  image  signal  by  a factor 
proportional  to  the  number  of  pixels.  (The  true  statistics  are  somewhat  more  complicated 
since  the  error  calculated  at  each  pixel  is  rectified).  Hence,  random  noise  in  the  signal 
should  cause  little  degradation  in  the  optimization  of  the  offset. 

While  the  moving  background  is  eliminated  in  the  difference  image,  an  object  being 
tracked  can  take  on  a complex  spatial  structure  that  requires  further  processing  prior 
to  final  detection.  The  purpose  of  the  image  enhancement  network  is  to  perform  some 
of  this  processing  on  the  difference  image  and  use  the  processed  result  to  enhance  the 
retina  output.  The  processing  performed  by  the  image  enhancement  network  is  application 
dependent.  In  our  implementation,  the  image  enhancement  network  performs  a low-pass 
spatial  filter  on  the  rectified  difference  image  and  multiplies  this  result  with  the  output 
of  the  artificial  retina.  We  envision  that  further  processing  will  be  performed  to  meet 
application-specific  requirements.  For  example,  a readout  that  multiplexes  and  digitizes 
the  analog  image  followed  by  digital  processing,  such  as  the  Automatic  Centroid  Extractor 
(ACE)  chip  [?],  will  be  necessary  for  a complete  real-time  tracking  system. 

Although  this  processing  model  is  very  versatile,  there  are  certain  limitations  imposed 
by  the  general  approach  and  the  model  in  its  current  form:  objects  to  be  tracked  should 
have  a different  velocity  than  the  background;  objects  should  typically  fill  only  a small 
portion  of  the  total  field  of  view  (FOV);  and  the  velocity  of  the  background  must  be 
relatively  constant  across  the  FOV . If  an  object  has  the  same  velocity  as  the  background,  it 
cannot  be  distinguished  from  the  background  by  its  motion.  If  an  object  fills  a significant 
part  of  the  FOV,  its  contribution  to  the  total  scene  will  bias  the  background  motion 
adaptation.  In  the  worst  case,  the  object  fills  so  much  of  the  FOV  that  it  essentially 
becomes  the  background.  Finally,  if  the  background  velocity  is  not  constant  across  the 
entire  FOV , the  image  shift  will  be  misaligned  and  portions  of  the  background  will  be  visible 
in  the  output.  We  are  currently  evaluating  techniques  to  overcome  these  limitations. 

3 Simulation  Results 

We  wrote  a simulator  which  allowed  us  to  evaluate  and  explore  variations  of  the  retina 
model.  To  preserve  the  analog  characteristics  of  the  model  and  provide  the  necessary 
flexibility  for  variations,  we  chose  to  implement  an  abstraction  of  an  electronic  prototyping 
breadboard.  Elements  of  the  model  are  represented  as  analog  circuit  modules  which  can 
be  “plugged  in”  and  “wired”  to  other  modules  on  the  breadboard.  A well-defined  interface 
simplifies  the  task  of  writing  new  circuit  modules,  and  existing  modules  can  be  grouped 
together  to  provide  arbitrarily  complex  modules. 

The  image  data  used  in  evaluating  our  circuit  designs  was  derived  from  a database  of 
real  and  synthetic  imagery.  The  synthetic  data  includes  various  simulated  cloud  scenes 
generated  with  a fractal  program,  earth  scenes  generated  by  a ray-tracing  program,  and  a 
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frequency  modulated  (chirped)  two-dimensional  sinusoid.  Our  real  data  is  comprised  of  a 
sequence  of  images  looking  down  upon  the  Earth  from  the  Space  Shuttle  Challenger  during 
the  deployment  of  the  Long  Duration  Exposure  Facility  (LDEF)  satellite  (Shuttle  mission 
STS  41C).  A sequence  of  images  representing  the  output  of  the  photodetector  array  was 
produced  by  spatially  sampling  a single  frame  of  a database  image  (interpolation  was  used 
to  obtain  subpixel  velocities). 

One  application  of  our  software  simulator  in  evaluating  a given  design  is  to  determine 
the  open-loop  (no  feedback)  response  to  a moving  background.  Figure  2 illustrates  the 
open-loop  response  of  the  design  presented  in  this  paper  to  a cloudy  earth  background 
moving  at  a velocity  of  —0.25  pixels  per  delay  (this  image  sequence  was  derived  from  the 
Space  Shuttle  data).  The  error  feedback  to  the  image-shifting  network  was  disabled,  and 
the  weights  of  the  image  shifter  were  initialized  to  values  that  yielded  the  desired  spatial 
offset.  A probe  was  inserted  in  the  circuit  to  save  the  error  values  to  a file,  and  the  simulator 
was  run.  Next,  the  average  error  for  the  entire  run  would  be  computed,  and  another  run 
would  be  performed  with  a different  spatial  offset.  Performing  a set  of  such  simulation  runs 
over  a range  of  spatial  offsets  yields  a curve  such  as  that  shown  in  Figure  2.  The  minima 
of  the  curve  occurs  at  the  spatial  offset  resulting  in  maximum  background  cancelation 
(since  the  background  in  the  delayed  image  of  the  example  is  displaced  by  —0.25  pixels, 
a +0.25  pixel  offset  is  required  to  bring  the  output  of  the  image  shifter  in  alignment  with 
the  output  of  the  artificial  retina).  For  spatially  simple  backgrounds,  the  open-loop  error 
curve  has  a straight  “V”  shape.  The  small  bends  and  kinks  in  the  curve  of  Figure  2 are  a 
result  of  the  complex  spatial  structure  of  the  clouds  in  the  moving  background. 

We  have  also  used  the  simulator  to  study  the  behavior  of  the  closed-loop  circuit.  In 
this  mode,  the  simulator  is  simply  run  on  a data  sequence,  and  the  spatial  shift  of  the 
image- shifting  network  is  controlled  by  the  error  feedback  network.  The  initial  value  of 
the  image  shifter’s  weights  are  all  zero  (no  spatial  shift),  but  quickly  begin  to  change  to 
adapt  to  the  background  motion.  Similar  to  the  open-loop  study,  probes  are  inserted  into 
the  circuit  in  order  to  save  signals  and  images  to  files  for  post-simulation  analysis. 

An  example  of  the  circuit’s  closed-loop  performance  using  the  Space  Shuttle  data  with 
a superimposed  moving  object  (small  Gaussian  blob)  is  shown  in  Figure  3.  The  back- 
ground in  the  circuit  input  (Figure  3a)  moves  —0.25  pixels  during  the  delay  period  of  the 
background  removal  network.  The  object,  located  above  and  to  the  left  of  the  arrow  in 
Figure  3a,  moves  +0.25  pixels  during  the  same  time  interval,  and  has  a relative  intensity  of 
fifteen  percent  with  respect  to  the  peak-to-peak  background  intensity  variation.  Figure  3b 
demonstrates  the  circuit’s  ability  to  remove  the  moving  background  in  the  input.  Although 
the  object  is  not  easily  distinguished  in  the  input,  it  can  clearly  be  seen  in  the  output. 

Stability  and  noise  sensitivity  are  important  concerns  with  systems  involving  feedback. 
Our  current  circuit  design  contains  no  damping  elements.  Appropriate  choices  of  the 
feedback  gain  and  amplifier  cutoff/saturation  levels  avoid  wild  instabilities.  However,  after 
the  system  has  adapted  to  the  motion  of  the  background,  we  observed  that  it  tends  to 
fluctuate  slightly  around  the  optimal  spatial  offset  value. 

In  addition  to  the  simulations  described  above,  which  were  performed  in  batch  mode, 
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Spatial  Offset 


Figure  2:  The  system  open-loop  error  in  response  to  a background  moving  at  a rate 
of  —0.25  pixels  per  delay  period.  As  the  spatial  offset  introduced  by  the  image- shifting 
network  approaches  the  complement  of  the  background  displacement,  the  error  between 
the  shifted,  delayed  image  and  the  unshifted  image  approaches  a minimum.  Since  the 
background  in  this  case  has  a velocity  of  —0.25  pixels/delay,  the  optimum  spatial  offset  is 
4-6.25  pixels.  The  derivative  of  the  error  is  used  to  correct  the  weights  in  the  image-shifting 
network  to  minimize  the  drift  error. 
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(a)  (b) 


Figure  3:  (a)  An  input  image  taken  from  a sequence  of  data  used  in  computer  simulation, 
and  (b)  tbe  corresponding  enKanced  output  image.  The  earth  background  in  the  input 
image  is  moving  at  a rate  of  —0.25  pixels  per  delay  period.  A superimposed  object  moving 
at  a rate  of  +0.25  pixels  per  delay  period  is  not  easily  distinguished  in  the  input  image,  but 
can  clearly  be  seen  in  the  upper-left  corner  of  the  enhanced  output  image.  The  relative 
intensity  of  the  object  with  respect  to  the  peak-to-peak  background  intensity  is  fifteen 
percent.  Initially,  the  spatial  offset  of  the  image- shifting  network  is  set  to  zero,  and  after 
a few  simulation  steps,  the  weights  of  the  image- shifting  network  adapt  to  the  background 
motion. 


ii'RMi  mi  nun  Mil  H 11  i I mill  1 i nmiiM  n i wmiwiiiifll  fiiiii!"»  n 


3rd  NASA  Symposium  on  VLSI  Design  1991 


4.1.9 


we  also  implemented  a real-time  version  of  the  algorithm  on  a PIPE  image  processing 
computer  [?].  The  PIPE  consists  of  eight  processors  operating  in  parallel,  each  of  which  can 
perform  complex  operations  on  several  frames  of  data  in  1 /60th  of  a second,  and  an  ISM  AP 
board  which  sums  all  pixels  in  a single  image.  The  input  to  the  PIPE  implementation 
came  directly  from  a tripod-mounted  video  camera.  The  output  was  displayed  on  a video 
monitor.  The  scalar  error  signal,  which  was  computed  by  the  ISMAP  board,  was  passed 
from  the  PIPE  to  an  IBM  PC  where  the  adaptation  algorithm  determined  the  new  image- 
shifting  network  weights  which  were  then  passed  back  to  the  PIPE. 

This  real-time  PIPE  implementation  allowed  us  to  test  the  performance  of  the  algo- 
rithm under  real-world  conditions  (environment  noise,  sensor  noise,  and  jitter  caused  by 
vibrations  in  the  camera),  and  also  determine  the  dynamic  offset  adjustment  for  a wide 
variety  of  backgrounds.  As  demonstrated  by  a video-tape  we  made  using  the  PIPE,  the 
algorithm  performs  well  under  these  conditions. 

4 Z-Plane  Implementation 

A typical  optical  sensor  system  includes  optical  elements,  a planar  array  of  detectors,  and 
a CCD  to  multiplex  the  detector  signals  into  a single  signal.  In  such  systems,  processing  on 
the  focal  plane  is  usually  limited  to  the  integration,  amplification,  and  serial  readout  of  the 
detector  signals.  Many  operations  that  are  currently  performed  off  the  focal  plane  on  the 
digitized  detector  signals,  such  as  spatial  and  temporal  filtering,  would  have  significantly 
higher  performance  if  they  could  be  performed  in  parallel  directly  on  the  continuous  analog 
detector  signals.  Recent  advances  in  fabrication  and  packaging  technology  provide  the 
ability  to  stack  analog  or  digital  processing  chips  together  and  bond  the  stack  onto  the 
back  of  a detector  array  (the  “Z”  dimension)  [?].  Using  this  technique,  hardware  that 
can  exploit  continuous  analog  image  signals  may  now  be  sandwiched  between  the  detector 
array  and  readout  electronics  to  form  a compact,  cube-like  image  processing  device. 

The  process  of  manufacturing  a Z-plane  module  consists  of  thinning  integrated  circuit 
(IC)  wafers  to  a desired  thickness  by  precisely  grinding  the  IC  substrate,  separating  the  IC 
wafer  into  individual  circuit  dies,  laminating  the  circuit  dies  into  a stack,  forming  external 
connections  to  the  laminated  circuits,  and  bonding  the  stack  to  the  detector  array.  Z-plane 
modules  with  detector  array  sizes  of  128x128  have  been  achieved,  and  arrays  of  up  to 
256x256  elements  are  in  the  range  of  current  Z-plane  technology.  Larger  focal  planes  have 
been  constructed  by  tiling  the  focal  plane  with  Z-plane  modules,  and  a specific  Z-plane 
implementation  has  been  shown  to  have  superior  signal-to-noise  characteristics,  provide 
more  data  processing,  and  consume  less  power  than  a comparable  CCD  implementation 
[?]- 

The  real-time  processing  required  for  our  applications  currently  cannot  be  achieved  on 
a conventional  digital  computer;  however,  our  model  maps  nicely  to  the  parallel,  pipelined 
structure  offered  by  Z-plane  technology.  Figure  4 illustrates  a preliminary  Z-plane  packag- 
ing concept  for  our  processing  model.  A photodetector  array  is  bonded  onto  the  first  layer 
which  implements  the  image  conditioning  network.  The  background  removal  network  is 
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Readout/Backplane 


(a)  (b) 


Figure  4:  A preliminary  Z-plane  design  of  our  model,  (a)  The  exploded  view  shows  the 
distribution  of  processing  modules  along  the  Z axis.  The  first  layer  is  a photodetector 
array.  An  array  of  amplifiers  and  a spatial- temporal  averaging  network  that  mimics  a 
portion  of  the  outer  retina  is  implemented  in  the  following  layer.  The  background  removal 
network  is  partitioned  among  the  next  three  layers  (the  X and  Y processing  stacks  perform 
the  delay  and  spatial  shift,  and  the  next  layer  computes  the  difference  and  error).  Image 
enhancement  is  performed  in  the  last  processing  layer,  (b}  A partially  assembled  cube 
illustrates  the  use  of  cube  faces  for  interlayer  and  external  communication. 

partitioned  into  three  layers.  The  first  two  layers  implement  the  delay  and  image- shifting 
network.  The  first  of  these  layers  performs  a delay  and  weighted  sum  of  the  inputs  in 
the  X direction,  and  the  second  layer  completes  the  weighted  sum  in  the  Y direction  (the 
stacking  of  individual  IC  dies  for  these  two  layers  are  shown  in  Figure  4a),  The  third 
layer  of  the  background  removal  network  implements  the  difference  and  error  computa- 
tions, and  the  last  layer  performs  image  enhancement.  The  faces  of  the  assembled  cube 
provides  additional  interlayer  communication  (such  as  the  feedback  signals  to  control  the 
image  shifter),  and  a readout  module  would  be  bonded  to  the  back  of  the  cube  to  provide 
an  external  interface.  The  multiplexed  readout  from  such  a device  could  be  either  analog 
or  digital,  depending  on  the  nature  of  the  readout  module.  Note  that  all  signals  up  to  the 
readout  module  are  analog. 
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5 Summary  and  Future  Work 

We  have  presented  an  analog  retina  model  which  adapts  to  background  motion  in  order  to 
enhance  objects  with  velocities  different  from  that  of  the  background.  The  results  we  have 
obtained  from  computer  simulation  demonstrate  the  model’s  ability  to  perform  such  en- 
hancement for  one-dimensional  motion.  We  also  presented  a hypothetical  implementation 
of  our  model  using  “Z”  focal  plane  technology. 

Since  the  one-dimensional  simulation  results  are  very  promising,  we  are  now  proceeding 
to  study  the  remaining  research  questions  to  be  answered  before  creating  a more  detailed 
design  to  be  implemented  with  Z-plane  technology.  We  are  currently  investigating  back- 
grounds with  two-dimensional  motion.  Although  in  principle  this  is  a straight  forward 
extension  to  the  current  model,  the  error  minimization  is  now  in  two-dimensions  and  may 
be  much  more  sensitive  to  system  control  parameters. 

Another  important  issue  is  how  the  system  will  perform  in  the  presence  of  external 
spatial-temporal  noise  and  internally  generated  noise  (such  as  noise  produced  from  analog 
component  drift  or  component  nonuniformity).  Some  of  the  noise  will  be  removed  by  the 
artificial  retina  and  by  the  implicit  spatial  averaging  in  the  error  signal,  but  we  must  verify 
that  the  remaining  noise  does  not  cause  the  optimization  of  the  spatial  offset  to  become 
unstable  or  experience  significant  drift. 

We  are  also  evaluating  alternate  image  enhancement  strategies  (many  of  these  would 
naturally  be  specific  to  a given  application)  and  techniques  to  handle  significant  background 
velocity  differences  over  the  FOV. 
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Abstract  - This  paper  gives  an  overview  of  the  research  being  conducted  at 
Stanford  University’s  Space,  Telecommunications,  and  Radioscience  Labora- 
tory in  the  area  of  low  energy  computation.  It  discusses  the  work  we  are  doing 
in  large  scale  digital  VLSI  neural  networks,  interleaved  processor  and  pipelined 
memory  architectures,  energy  estimation  and  optimization,  multichip  module 
packaging,  and  low  voltage  digital  logic. 

1 Introduction 

Our  research  in  low  energy  computation  for  signal  processing  is  being  supported  in  large 
part  by  NASA.  The  neural  network  research  is  being  funded  by  the  Center  for  Aeronautics 
and  Space  Information  Sciences  (CASIS).  Low  energy  computing  research  is  being  funded 
by  NASA  grant  NAGW1910,  “Low  power  signal  processing  technology  for  space  flight 
applications”. 

2 Overall  motivation 

Our  research  in  low  energy  computing  is  driven  by  the  need  to  maximize  computation 
rates  in  power  constrained  environments.  Space  based  data  systems  and  large  scale  neural 
networks  both  require  low  energy  per  operation;  in  flight  systems,  to  minimize  power 
consumption  during  data  gathering,  processing,  storage,  and  communication;  in  neural 
networks,  to  achieve  the  necessary  computation  rates  within  manageable  power  budgets. 
These  systems  are  characterized  by  high  sustained  levels  of  computational  effort,  unlike 
typical  portable  computer  applications,  which  tend  to  have  bursty,  and  much  more  modest, 
information  processing  requirements. 

3 4:2  adder  based  architectures 

We  have  been  building  deeply  pipelined,  parallel  signal  processors  since  1985  [17,18,3,2,19]. 
We  came  up  with  a multiplier  architecture  which  struck  a balance  between  throughput, 
latch  overhead,  and  regularity  [18].  The  multiplier  consists  of  a tree  of  “4:2  adders”  (see  Fig 


Figure  2:  4:2  multiplier.  N partial  products  are  reduced  to  2 in  log2(iV)/2  stages  of  4:2 
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2).  A 4:2  adder  (see  Fig  1)  has  4 inputs,  a carry  in,  and  generates  two  outputs  and  a carry 
out.  The  carry  out  does  not  depend  on  the  carry  in.  The  4:2  adder  can  be  implemented 
using  two  full  adders,  but  a direct  logic  implementation  can  reduce  the  critical  path  from 
4 xors  in  series  to  three.  A multiplier  built  out  of  a tree  of  4:2  adders  has  a much  more 
regular  structure  than  a Wallace  tree  [1],  which  uses  a full  adder  to  reduce  three  partial 
products  to  two  at  each  stage.  The  4:2  tree  reduces  4 partial  products  to  two  at  each 
stage,  and  has  the  self-similarity  of  a binary  tree.  A 4:2  adder  can  efficiently  accumulate 
successive  products  in  carry-save  form.  It  can  also  be  used  in  an  ALU  to  perform  arithmetic 
operations  in  time  independent  of  the  number  of  bits  in  the  operands. 

We  have  recently  shown  that  power  is  minimized  in  a parallel  multiplier  when  Id  — 
11  [10].  A 4:2  adder  has  a logic  depth  of  10,  including  latches.  By  comparison,  RISC 
microprocessors  typically  have  logic  depths  around  40. 

We  currently  have  a number  of  projects  which  are  implementing  architectures  based  on 
the  latency  in  a 4:2  adder.  We  were  becoming  concerned  about  the  feasibility  of  running 
systems  at  the  clock  rates  implied  by  a logic  depth  of  10:  in  0.8  micron  CMOS,  a 4:2 
adder  based  clock  generator  circuit  runs  at  400MHz  [20].  However,  similar  speeds  have 
been  reported  elsewhere  [21].  Recently,  with  the  opportunity  presented  by  tiled  architec- 
tures and  3D  multichip  modules  as  discussed  in  [10,27],  it  appears  that  deeply  pipelined 
architectures  can  also  achieve  good  performance  at  very  low  energy. 

4 Neural  Nets 

Large  scale  neural  nets  will  require  on  the  order  of  1015  connections  per  second  (CPS)  [9]. 
Digital  VLSI  neurochips  reported  so  far  require  around  InJ  per  synaptic  connection  [22] ; 
1015  CPS  would  require  a megawatt!  Biological  neurons  require  around  lfj  per  synaptic 
connection,  6 orders  of  magnitude  less.  Attaining  biological  energy  efficiency  in  silicon  is 
a formidable  challenge.  We  have  identified  a number  of  factors  which  together  may  reduce 
connection  energy  by  5 orders  of  magnitude  to  lOfJ  per  connection,  permitting  1015  CPS 
at  around  10  watts.  These  include:  reduced  arithmetic  precision  (lOx),  reduced  feature 
size  (lOx),  and  low  voltage  operation  (lOOOx). 

In  addition  to  investigating  performance  of  large  networks,  we  are  implementing  a 
digital  Boltzmann  machine  [22]  to  demonstrate  the  viability  of  reduced  precision,  pipelined 
digital  learning  machines.  The  chip  is  being  implemented  in  2.0u  CMOS,  and  consists  of  32 
5-bit  neural  processors,  each  supporting  IK  5-bit  weights  and  capable  of  80MHz  operation. 
The  chip  will  be  capable  of  2.5  billion  connections  per  second,  and  320  million  connection 
updates  per  second. 

5 Pipelined  Memory 

We  are  implementing  a pipelined  memory  architecture  (see  Fig  3)  which  achieves  high 
throughput  by  recursively  subdividing  the  memory  array  into  sections  which  can  be  tra- 
versed in  a single  cycle.  Addresses  are  partially  decoded  in  each  section.  The  remaining 
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Figure  3:  Pipelined  memory 

address  bits  are  routed  to  the  appropriate  subsection  where  additional  bits  are  decoded. 
At  the  lowest  level,  the  remaining  bits  are  decoded  and  data  is  read  out  of  or  written  into 
a memory  block.  For  read  operations,  the  data  is  delivered  back  up  through  the  subsec- 
tions on  subsequent  cycles  until  it  emerges  at  the  pads.  For  write  operations,  the  data 
accompanies  the  address  down  the  tyee.  : v : . — — 

The  size  of  the  memory  block  is  matched  to  the  propagation  delay  through  a 4:2  adder. 
This  turns  out  to  be  about  32  words  x 32  bits.  We  have  written  an  optimizer  which  sfee% 
the  transistors  in  this  block  for  the  minimum  area  and  power  that  matches  the  delay  [25]^ 
We  pipeline  the  address  decode  and  data  return,  placing  pipestages  to  minimize  power 
dissipation.  Power  dissipation  in  the  memory  is  greatly  reduced  by  selectively  clocking  the 
portion  of  the  memory  which  contains  the  data, leaving  the  rest  of  the  system  on  standby. 

filerarchical  memory  organization  first  appeared  in  Mead  and  Conway  |l2,ll|,  but  this 
architecture  was  not  pipelined  An  unpipelined  binary  tree  memory  was  described  at  the 
1987  International  Test  Conference  [8,26].  Hierarchical  address  decoding  was  reported  in 
a 4Mb  SRAM  with  selective  enable  to  reduce  power  dissipation  [7]. 

A pipelined  memory  architecture  was  discussed  in  [28].  The  CT7C158  is  a pipelined 
64K  SRAM  offered  by  Cypress  Semiconductor,  who  say:  “Pipelined^  RAMs  are  used  in 
writeable  control  store,  DSP  and  logic  analyzer/tester  applications  where  throughput  is 

the  critical  parameter.”  -t".t — - - r — 

Our  pipelined  memory  is  the  first  to  combine  hierarchical  address  decoding  and  selective 
clocking  to  maintain  very  high  throughputs  and  very  low  power  dissipation^ 
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Figure  4:  Interleaved  processor  pipeline,  with  a normal  RISC  pipeline  for  comparison.  In 
this  example,  instruction  fetches  and  memory  accesses  take  four  cycles.  There  are  four 
independent  instruction  streams  in  various  passes  of  execution. 

6 Interleaved  Processor 

We  are  working  on  a processor  architecture  which  achieves  high  performance  by  interleaving 
independent  instruction  streams  on  a deeply  pipelined  processor  (see  Fig  4).  The  number 
of  independent  streams  is  matched  to  the  latency  in  the  pipelined  memory.  The  clock 
frequency  is  a multiple  of  a RISC  clock,  and  is  obtained  by  placing  extra  pipestages  at 
critical  points  in  a RISC  architecture.  The  number  of  extra  pipestages  is  smaller  than 
expected  because  many  of  the  normal  RISC  stages  do  not  use  up  an  entire  clock  cycle. 
Our  objective  is  to  achieve  a 4x  speedup  over  RISC  in  a given  technology,  and  to  implement 
a subset  of  the  MIPS  R3000  instruction  set.  We  are  experimenting  with  a variety  of  power 
reduction  techniques  at  the  circuit  and  system  level  in  the  processor  design. 

Multiple  instruction  stream  processors  have  been  built  before  (Burton  Smith’s  work  on 
HEP,  Horizon,  and  Tera  [16,15]),  but  only  in  the  context  of  large  supercomputers  and  not 
single  integrated  circuits,  and  not  matched  to  the  latency  of  a pipelined  memory.  Edward 
Lee  at  UC  Berkeley  proposed  an  interleaved  architecture  for  use  in  signal  processing  [14], 
but  his  design  is  not  pipelined  as  deeply  as  ours,  and  does  not  include  pipelined  mem- 
ory. The  only  reference  we  have  found  so  far  which  describes  an  interleaved  processor 
and  a pipelined  memory  is  a Japanese  paper  on  gate-level  pipelined  Josephson  Junction 
circuits  [28],  which  also  describes  a method  to  increase  the  throughput  of  CMOS  memory 
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by  pipelining,  but  the  two  concepts  are  not  synergized,  and  the  memory  organization  is 
not  discussed.  Stone  and  Cocke  say  “some  combination  of  long  pipelines  and  multiple 
interleaved  instruction  streams  may  eventually  prove  effective  for  combining  high  speed 
and  high  efficiency”  but  give  no  details  [23] . 

The  RISC  community  is  also  investigating  techniques  for  increasing  performance.  The 
two  chief  techniques  are  superscalar  and  superpipelined  architectures  [5].  In  superscalar, 
more  than  one  instruction  may  be  in  progress  at  a given  time.  In  superpipelined,  the 
RISC  pipe  is  broken  into  a number  of  smaller  stages  with  reduced  logic  depth.  Both  of 
these  approaches  result  in  added  control  complexity  managing  the  potential  hazards  and 
resource  conflicts  which  may  result. 

Superscalar  machines,  such  as  the  Intel  I860,  fetch  more  than  one  instruction  on  each 
cycle,  and  execute  in  parallel  whenever  possible.  There  are  restrictions  in  the  combina- 
tion of  instructions  which  can  be  issued  simultaneously.  Superscalar  increases  resource 
utilization  but  does  not  increase  the  throughput  of  a given  functional  unit. 

We  reduce  RISC  logic  depth  by  a factor  of  4,  and  introduce  4 independent,  interleaved 
instruction  streams.  The  streams  are  kept  independent  to  avoicTtheTiardware  complexities 
associated  with  managing  a highly  pipelined  single  thread  of  control.  Each  instruction 
stream  executes  its  next  instruction  every  fourth  cycle.  The  control  complexity  is  no  worse 
than  for  a RISC  machine  but  the  throughput  is  4 times  greater  on  problems  that  can  be 
parallelized.  Fortunately,  these  are  commonplace  in  signal  processing.  The  architecture 
also  supports  zero-overhead  context  switching  of  up  to  4 processes.  This  is  very  useful  in 
embedded  real  time  control  applications. 

6.1  Timing 

Real  time  signal  processing  tasks  often  require  “precise”  timing.  This  is  not  easy  in  cache- 
based  architectures,  since  cache  miss  recovery  times  can  often  be  data  dependent.  The 
pipelined  memory/interleaved  processor  behavior  is  precise:  instruction  latencies  are  fixed. 
Memory  fetches  always  takes  4 cycles.  There  are  never  any  cache  misses.  Branch  timing 
is  precise. 

In  a conventional  RISC  machine,  the  latency  that  takes  place  during  a branch  is  un- 
predictable, because  it  depends  on  whether  the  target  address  is  in  the  instruction  cache, 
and  if  so,  how  it  is  aligned  within  the  cache  entry  that  contains  it.  Given  a 4 cycle  latency 
to  fill  a line  in  the  cache,  and  a cache  linewidth  of  4 words,  a branch  target  will  only  point 
to  the  first  word  in  the  line  25%  of  the  time.  The  system  must  stall  fetching  the  line 
following  the  line  containing  the  branch  target  address.  The  AMD29000  “branch  target 
cache”  solves  this  problem  by  aligning  cache  lines  to  branch  targets.  This  increases  the 
complexity  of  the  memory  subsystem.  The  interleaved  processor  solves  this  problem  by 
maintaining  a fixed  latency  on  every  instruction  fetch. 
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6.2  Energy 

Our  objective  has  been  to  maximize  overall  performance.  With  the  advent  of  Multichip 
module  technology,  the  performance  of  an  individual  chip  must  be  considered  in  light  of 
the  system.  We  now  are  designing  to  maximize  performance  at  minimum  energy.  The  best 
way  to  do  this  is  to  obtain  the  maximum  possible  throughput,  and  use  the  performance 
margin  to  lower  the  supply  voltage  until  all  the  available  area  is  used  and  the  power  budget 
is  met. 

The  clock  frequency  can  be  increased  by  a factor  of  4,  so  that  each  stream  can  execute 
as  fast  as  a RISC  processor  in  the  same  technology,  and  the  processor  can  achieve  4 times 
RISC  performance.  This  implies  400MHz  in  0.8  CMOS.  Although  this  is  feasible  for  small 
numbers  of  processors,  we  plan  instead  to  lower  voltage  by  a factor  of  4,  to  1.25V.  This 
will  give  us  the  same  2D  performance  density  as  a RISC  machine,  but  will  require  only 
1/16  the  energy  per  operation  and  1/64  the  power.  We  can  capitalize  on  MCM  technology 
to  achieve  64  times  the  performance  with  64  times  the  area  for  the  same  power  budget. 

Also,  because  resources  are  pipelined,  more  time  is  available  to  wake  up  an  idle  resource 
or  put  it  on  standby.  Resources  only  need  to  be  clocked  if  they  are  being  used.  If  a resource 
is  used  by  one  stream,  but  not  by  the  next,  the  inputs  to  that  resource  can  retain  their 
previous  values. 

Register  files  normally  consume  a significant  portion  of  the  power  budget.  Since  each 
stream  has  its  own  register  file,  the  access  rate  to  a register  file  can  be  1/4  the  system 
clock  frequency.  Conventional  SRAM  is  faster  and  lower  power  than  multiported  register 
files  since  the  bitlines  never  have  to  swing  more  than  lOOmV  for  reading  or  writing.  If  the 
SRAM  can  be  accessed  in  a single  cycle,  it  can  emulate  a 4-port  memory  which  can  support 
any  combination  of  up  to  4 reads  or  writes  every  4 cycles.  In  its  standard  configuration  it 
would  be  accessed  sequentially  to  fetch  two  operands  and  write  back  a third.  Whether  this 
results  in  less  energy  depends  on  how  often  operand  addresses  are  repeated  on  successive 
instructions. 


6.3  Area 

The  interleaved  processor  should  require  area  comparable  to  a RISC  processor  because 
four  sets  of  registers,  program  counters,  and  other  state  registers  take  no  more  area  than 
on-chip  instruction  and  data  caches. 

7 Multichip  Modules 

Multichip  module  packaging  provides  a number  of  significant  new  opportunities  in  sys- 
tem architecture  and  implementation.  Bare  die  can  be  placed  much  closer  together  than 
packaged  parts,  leading  to  shorter  wires  and  reduced  communication  energy.  Area  bond- 
ing reduces  lead  inductance,  permitting  higher  frequency  interchip  communication.  Small 
bonding  pads  and  high  connective  capacity  support  seamless  interchip  communications 
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optimized  for  propagating  signals  a few  centimeters.  Intrinsic  bypass  capacitance  due  to 
thin  dielectric  separation  of  Vdd  and  Gnd  planes  results  in  higher  noise  immunity. 

The  net  result  is  the  opportunity  to  reduce  communication  energy  and  increase  system 
level  performance  by  orders  of  magnitude  compared  to  conventional  packaging  techniques. 
We  are  developing  interconnect  structures,  data  transmission  circuits,  and  clock  distribu- 
tion structures  for  high  performance  (hundreds  of  MHz),  low  power  (tens  of  mW)  MCM 
systems.  Much  of  our  work  in  this  area  has  been  reported  in  [24]. 

We  have  designed  a test  module  which  is  being  fabricated  by  ATT.  It  includes  passive 
structures  for  measuring  capacitance,  crosstalk,  and  characteristic  impedance  of  a variety 
of  conductor  geometries.  It  also  has  two  sites  for  MOSIS  Tiny  Chips  which  will  test  the 
interconnect  by  exchanging  pseudorandom  bitstreams  through  single  ended  and  differential 
transceivers  at  data  rates  in  excess  of  200  MHz, 

74  Tiled  architectures  for  signal  processing 

The  opportunity  exists  to  extend  the  concept  of  regularity  and  locality  so  widely  used  in 
VLSI  design  to  the  multichip  module  level,  and  to  identify  a set  of  processor  tiles  which 
can  tessellate  the  plane  to  generate  massively  parallel  architectures.  We  are  investigating  a 
variety  of  “tiled”  architecture  opportunities.  We  have  extended  our  neural  net  Boltzmann 
machine  architecture  to  accommodate  an  arbitrarily  large  two  dimensional  array  of  chips. 


8 Multiprocessing 

The  interleaved  processor  is  inherently  a symmetric  shared  memory  multiprocessor.  Mem- 
ory consistency  is  guaranteed  because  there  is  no  cache.  We  are  investigating  ways  to 
interconnect  interleaved  processors  for  massively  parallel  multiprocessing. 

8.1  Hierarchical  pipelined  ringbus 

One  possible  organization  of  a massively  parallel  system  is  a “hierarchical  ring  bus”  ar- 
chitecture which  supports  high  bandwidth  pipelined  data  exchange  among  multiple  pro- 
cessors. The  overall  topology  consists  of  rings  of  processors  connected  by  gateways.  Each 
local  ring  can  sustain  data  transfers  at  the  processor  clock  rate.  Because  the  bus  itself 
is  pipelined,  multiple  transactions  can  be  in  progress  concurrently,  up  to  the  number  of 
processors  in  the  ring.  One  of  the  nodes  in  the  ring  can  be  a gateway  to  another  ring  and 
can  sustain  the  same  I/O  bandwidth.  We  plan  to  match  the  bus  clock  frequency  to  the 
latency  of  a 4:2  adder- 

This  architecture  has  been  proposed  elsewhere  [15].  We  think  it  is  well  matched  to 
the  performance  and  latency  of  the  interleaved  processors  and  multichip  module  based 
multiprocessors.  In  the  spirit  of  interleaved  instruction  streams,  the  latency  to  complete  a 
single  bus  transaction  will  be  at  least  equal  to  the  number  of  processors  in  the  ring,  but 
a separate  bus  transaction  can  be  in  progress  simultaneously  on  each  segment  of  the  ring. 
This  will  result  in  substantially  higher  throughput  than  conventional  bus  architectures  - in 
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excess  of  1 Gbyte/sec.  This  architecture  is  well  suited  to  datastream  oriented  algorithms 
common  in  real  time  signal  processing. 

Although  this  approach  introduces  single  point  failures  at  each  node  in  the  ring,  when 
placed  in  the  context  of  3D  multichip  module  implementation  we  think  the  approach  has 
some  significant  advantages. 

The  ringbus  concept  can  be  extended  gracefully  to  large  numbers  of  processors  by  recur- 
sively adding  subrings  connected  by  gateways.  We  will  be  analyzing  the  implementation 
complexity,  energy,  and  performance  of  this  approach  in  comparison  to  other  processor 
communication  networks. 

Of  key  interest  is  mapping  numerically  intensive  signal  processing  problems  onto  this 
architecture.  A 1024  processor  system  might  consist  of  64  rings  with  16  processors  in 
each  ring.  At  400  MIPs  per  node  and  1 Gbyte/sec  per  ring,  total  performance  would  be 
400GIPS;  total  throughput  would  be  64  Gbytes/sec.  Ring  size  can  be  optimized  to  balance 
instruction  and  communication  bandwidth. 

9 Energy  estimation  and  optimization 

We  estimate  energy  using 

Eac  = ~ aCV 2 
Edc  = IdcV/f 

where  a is  the  activity  ratio,  the  fraction  of  transistors  switching  on  each  cycle,  C is  the 
capacitance  being  switched,  V is  the  supply  voltage,  Idc  is  the  DC  current,  and  / is  the 
clock  frequency. 

This  technique  relies  on  short  circuit  current  being  a small  fraction  of  the  total. 

We  are  investigating  techniques  for  minimizing  power  dissipation  by  minimizing  tran- 
sistor sizes  while  minimizing  short  circuit  current.  These  are  conflicting  constraints,  and 
can  lead  to  substantial  power  reductions  over  techniques  which  ignore  short  circuit  current 
and  assume  minimum  size  devices  result  in  minimum  power. 

We  have  modified  our  timing  simulator  to  measure  AC  power  dissipation  by  accumulat- 
ing dumped  charge.  Preliminary  results  suggest  good  agreement  with  power  measurements 
on  fabricated  chips.  We  are  extending  this  technique  to  measure  peak  power.  We  have 
developed  a memory  block  optimizer  which  sizes  transistors  in  the  pipelined  memory  to 
maximize  a “merit”  function  which  is  a weighted  combination  of  performance,  power,  and 
area.  We  are  including  the  effects  of  short  circuit  current  on  both  our  transistor  sizer  and 
our  memory  block  optimizer. 

We  have  found  that  transistor  sizing  is  important  in  optimizing  highly  pipelined  de- 
signs. Balancing  clock  delays  is  especially  important  to  minimize  clock  skew  in  the  system. 
Transistors  can  also  be  sized  to  minimize  energy,  which  involves  balancing  short  circuit 
current  against  gate  capacitance. 
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10  Low  Voltage  Digital  Logic 

Massively  parallel  architectures  tiled  on  3D  stacked  multichip  modules  can  quickly  exceed 
the  ability  to  extract  heat  from  the  structure.  Reducing  the  supply  voltage  promises 
substantial  reductions  in  energy  and  power;  we  are  investigating  the  practical  limits  to  low 
voltage  operation.  This  area  is  covered  in  depth  in  [10]. 

Our  approach  to  low  energy  computation  has  attracted  interest  from  a number  of 
sources.  More  detailed  investigation  into  the  opportunity  is  being  funded  as  a “Research 
thrust”  by  Stanford’s  Center  for  Integrated  Systems.  These  research  thrusts  involve  inter- 
action with  technical  liaisons  from  CIS  industrial  partners.  So  far,  the  Ultra  Low  Power 
thrust  has  liaisons  at  DEC,  GE,  IBM,  Intel,  National  Semiconductor,  and  TI. 

11  Personnel 

Who  the  group  is: 

Professor  Allen  M.  Peterson,  Principal  Investigator 
P.  Roger  Williamson,  Senior  Research  Associate 
James  B.  Burr,  Senior  Research  Engineer 
Low  Energy  Computing 

Sevan  Baas  computer  architecture  ; 1 ■■■  ■■■'■  — ' 7 - 

Jim  Burnham  high  speed  interconnect  ~ ~ 

Ely  Tsern  interleaved  algorithms 
Gerard  Yeh  Low  energy  VLSI  circuits 
Sabeer  Bhatia  Low  energy  process  design 

Neural  Networks  T j ■ ■■  • . ' - _______  - - - 

Kan  Boonyanit  Approximate  Gradient  Descent  ' 

Karen  Huyser  Wafer  Defect  Classification 

Michael  Leung  Texture  Recognition 

Michael  Murray  Precision,  Learning,  and  VLSI 

Collaboration 

ATT,  multichip  modules 
Sun,  energy  optimization 

Intel,  digital  neural  network  architectures  — — - - • 

Ricoh,  neural  net  coprocessors 
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12  Conclusion 

Our  research  in  low  energy  computation  has  been  motivated  by  recent  trends  in  VLSI 
technology,  multichip  module  packaging,  and  application  architectures.  We  believe  the  op- 
portunity exists  to  achieve  very  high  computation  rates  in  power  constrained  environments 
by  reducing  decision,  storage,  and  communication  energy. 
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and  Evaluation  for  SEP-Hardened  Circuits 
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United  Technologies  Microelectronics  Center 
1575  Garden  of  the  Gods  Road 
Colorado  Springs,  CO  80907 

Abstract-  This  paper  describes  the  technology,  design,  simulation,  and  evalu- 
ation for  improvement  of  the  SEP  hardness  of  gate-array  and  SRAM  cells. 
Through  the  use  of  design  and  processing  techniques,  it  is  possible  to  achieve 
an  SEP  error  rate  less  than  1.0E-10  errors/bit-day  for  a 90worst-case  geosyn- 
chronous orbit  environment. 


1 Single  Event  Upset 

Single  Event  Phenomenon  (SEP)  occurs  when  a particle  or  heavy  ion  interacts  with  the 
silicon,  depositing  charge  on  critical  circuit  nodes,  causing  data  loss.  Devices  that  retain 
data  (RAMs  and  flip-flops,  for  example)  are  subject  to  SEP,  and  a particle  interaction 
in  one  part  of  the  chip  can  cause  data  loss  in  a circuit  located  far  away.  Non-storage 
nodes  can  propagate  the  pulse  to  other  circuit  nodes,  but  no  permanent  data  will  be  lost. 
Storage  devices  like  RAMs,  latches,  and  flip-flops  may  detect  false  clock  pulses,  or  reset 
signals  caused  by  a pulse  on  a circuit  node  somewhere  in  the  clock,  or  reset  generation  and 
buffering  circuitry,  and  lose  data. 

The  storage  node  critical  charge,  Qc,  is  the  amount  of  charge  that  must  be  deposited 
on  the  storage  node  in  order  to  upset  the  stored  data.  Increasing  the  critical  charge  for 
the  sensitive  nodes  in  a cell  will  decrease  the  SEP  error  rate  by  lowering  the  probability  of 
encountering  an  ion  with  sufficient  LET  value  to  upset  the  cell.  This  critical  charge  can 
be  increased  by  changing  the  transistor  sizes  to  make  the  cell  more  stable,  adding  parallel 
paths  in  the  cell,  and  increasing  the  feedback  switching  time.  A more  stable  storage  cell 
will  require  a greater  voltage  change  on  the  storage  node,  or  a longer  voltage  pulse  to 
disturb  the  data. 

2 Charge  Deposition  Model 

In  the  past,  the  literature  has  implied  that  the  charge  generated  by  a single  event  may 
be  modeled  by  using  an  ideal  current  source  with  an  exponential  time  decay.  Using  this 
method,  the  current  source  is  connected  directly  to  the  sensitive  node,  and  the  charge 
applied  is  the  time  integral  of  the  current  pulse.  This  method  can  over-predict  the  value 
of  the  critical  charge  for  the  memory  cell  because  the  current  source  causes  the  junction 
to  which  the  current  source  is  applied  to  become  forward  biased  and  sink  a significant 
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amount  of  charge  to  the  power  supply.  Physically,  the  charge  collection  process  is  self- 
limiting,  and  the  junction  will  never  conduct  current  in  the  forward  direction.  This  is  a 
very  common  error  which  has  led  to  substantially  over-predicting  the  SEP  hardness  of  a 
particular  design.  This  can  lead  to  substantially  higher  SEP  error  rates  in  the  circuit  than 
can  be  tolerated  by  the  system  design.  In  this  paper,  we  describe  process,  simulation,  and 
design  techniques  that  may  be  used  to  improve  the  SEP  hardness  of  CMOS  circuits. 

3 SEP  Simulation 

We  have  used  both  HSPICE  and  DAVINCI  (tm)  Jlj  to  model  SEP  phenomenon.  In  the 
SPICE  simulation,  the  cell  sub-circuit  is  accurately  modeled  using  device  parameters  which 
have  been  demonstrated  to  correlate  with  measured  silicon  values.  Figure  1 shows  the  sub- 
circuits that  model  the  charge  deposition.  They  use  idealized  MOS  devices  as  switches, 
and  standard  SPICE  components  for  everything  else.  This  modeling  technique  prevents 
forward-biasing  of  the  junction  and  provides  a more  accurate  simulation  of  the  charge 
required  for  circuit  upset  [2j.  An  internal  node,  Ndep,  HT initialized  to  a voltage,  Vdep, 
by  switch  N3  where  Vdep  x Cdep  = J^dep  (Qdep  is  the  simulated  deposited  charge).  A 
switch,  P2  or  N2,  is  turned  on  in  approximately  (J.l  ns,  shorting  the  node  under  test  to  a 
power  supply  through  a resistor  that  represents  the  resistance  of  the  bulk  silicon; 

(LOW  PULSE)  (HIGH  PULSE) 


Figure  1:  SPICE  Charge  Deposition  Circuit 


As  charges  are  added  to  the  node  under  test,  the  same  amount  of  charge  is  removed  from 
the  internal  node  Ndep.  ^Vhen  the  internal  node  Ndep  is  depleted  of  charge,  a switch,  Pi 
or  Nl,  turns  off  so  the  circuit  can  recover  its  node  voltages.  A similar  but  slightly  different 
version  of  this  charge  deposition  model  is  used  for  p-f /n-  and  n-/p+  junction  interactions. 
This  technique  has  proven  extremely  valuable  to  optimize  the  design  and  layout  of  logic 
and  memory  circuits  for  SEP  hardness. 
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Numerical  simulation  of  the  charge  deposition  from  a heavy  ion  hit  has  been  attempted 
by  others.  [3,4]  We  have  performed  three-dimensional  numerical  simulation  of  single  event 
upset  using  the  three-dimensional  simulator,  DAVINCI  tm.  The  simulations  were  per- 
formed on  an  n-channel  junction  of  a twin-tub  CMOS  device  with  approximately  4 um  of 
an  epitaxial  layer.  In  figure  2,  the  potential  contours  of  the  device  junction  are  shown  48 
ps  after  a single  event  hit  with  a gold  ion.  Note  that  the  n+  junction  is  no  longer  at  five 
volts  and  that  the  distribution  of  the  funnel  favors  the  epi/p- well  junction.  As  the  voltage 
at  the  n+  junction  is  reduced,  there  is  less  potential  difference  to  the  funnel  as  opposed  to 
the  potential  difference  of  the  funnel  to  the  epi,  which  is  held  at  five  volts. 

seu7gold30_let  Potential  48.  Ops  seu7-  Node  Voltages 


Figure  2:  Junction  Potential  After  Ion  Figure  3:  n+  junction  vs.  p-well  and  sub- 
Strike  strate 

Figure  3 shows  the  potential  of  the  n+  junction  relative  to  the  p-well  contact  and  epi 
substrate  contact.  Note  that  the  time  constant  is  ~5-10  ps.  Whether  a flip-flop  will  change 
state  as  a result  of  this  SEP  event  will  depend  upon  how  long  the  n+  junction  potential  is 
below  the  switch  point  of  the  cell.  The  n+  junction  can  go  to  zero  volts  and  not  cause  an 
upset  if  it  recovers  before  the  zero  can  propagate  back  through  the  cross-coupled  logic.  We 
have  found  DAVINCI  tm  useful  for  looking  at  wafer  fabrication  process  methods  for  SEP 
hardening  of  CMOS  devices,  as  it  enables  the  evaluation  of  effects  of  doping  concentration, 
epi  thickness,  etc. 


4.3.4 


4 Heavy  Ion  Testing  And  SEP  Numerical  Calculation 

Heavy  ion  testing  for  this  work  was  performed  at  Brookhaven  National  Laboratories,  using 
their  Tandem  Van  de  Graaff  system.  Three  ion  conditions  were  chosen  for  this  testing. 
They  included  Gold  at  350  MeV,  Iodine  at  320  MeV  and  Bromine  at  285  MeV.  The  LET 
values  were  further  varied  by  adjusting  the  angle  of  incidence  from  0 to  60  degrees. 

To  determine  the  SEP  error  rate  from  measured  data,  the  effective  cross-  section  is 
selected  at  an  LET  of  100  MeV  x cm’ljmg  (surface  value).  The  upset  rate  calculation 
can  be  performed  using  either  CREME  [5]  or  SpaceRad  [6]  programs.  The  simulation 
conditions  for  determining  the  error  rates  quoted  in  this  paper  are  as  follows: 

1.  Geosynchronous  circular  orbit,  35900  km; 

2.  Orbital  inclination,  0 degrees; 

3.  Adams  90%  worst-case  environment,  including  the  earth’s  shadow  and  geomagnetic 
storms; 

4.  All  ions  from  Hydrogen  through  Uranium  1<Z<92. 

5 Process  Techniques  For  SEP  Hardening  Of  CMOS 


Wafer  fabrication  processing  can  have  a strong  effect  on  the  §EP  sensitivity  of  CMOS 
circuits.  As  shown  in  the  DAVINCI  tm  simulations  above,  over  half  of  the  charge  deposited 
by  a heavy  ion  can  be  collected  at  the  epitaxial  junction,  away  from  sensitive  circuit  nodes, 
if  the  epitaxial  layer  is  sufficiently  thin.  Also,  because  the  drive  current  of  p-channel  devices 
is  typically  less  than  that  of  n-channel  devices,  the  most  SEP-sensitive  nodes  tend  to  be 
n-channel  nodes  supported  by  p-channel  transistors.  Therefore,  to  minimize  the  charge 
collection  on  these  nodes,  a p-well  type  process  is  desirable  since  the  p-well-to-substrate 
junction  will  help  to  collect  a substantial  portion  of  the  deposited  charge. 

Also,  because  n-type  dopants  (n-type  substrates  are  used  for  a p-weB  process)  diffuse 
much  slower  than  p-type  dopants,  it  is  possible  to_fabricate  much  thinner  epitaxial  lay- 
ers for  a p-well  process,  further  improving  the  SEP  sensitivity  of  the  technology.  The 
lower  sheet  resistances  and  (typically)  higher  doping  of  a p-well  process  also  help  to  elimi- 
nate SEP-induced  latch-up.  High  doping  concentrations  also  help  to  increase  the  junction 
capacitance,  further  improving  SEP  susceptibility.  Thin  gate  oxide  also  increases  node 
capacitance,  thereby  increasing  the  critical  charge  on  a node  and  improving  SEP  hardness. 
Poly-resistors  or  natural  p-channel  transistors  can  also  be  added  to  the  process  to  allow 
the  design  of  high-density  SEP-hardened  memory  cells.  _ 

SOS  and  SOI  (Silicon- On- Sapphire  and  Silicon- On-Insulator)  processes  reduce  the 
amount  of  charge  collected  on  the  junction  and  the  effective  critical  charge  on  each  node. 
Thin-film  SOI  devices  are  also  sensitive  to  bipolar  snap-back.  This  has  the  effect  of  making 
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the  channel  region  of  the  n-channel  transistors  sensitive  to  SEP  upset.  Therefore  it  is  not 
sufficient  to  process  a design  on  SOS/SOI  substrates  to  obtain  good  SEP  performance. 
Design  and  special  processing  techniques  must  be  used  also  to  assure  SEP  hardness  of  the 
circuits. 

Commercial  CMOS  wafer  fabrication  processes  usually  do  not  consider  SEP  upset  and 
latchup  in  their  design.  They  are  optimized  for  speed  and  density,  both  of  which  can 
compromise  good  SEP  performance.  UTMC  has  designed  its  twin-tub  epitaxial  p-well 
CMOS  process  (including  poly-resistors)  and  layout  rules  to  provide  an  optimum  balance 
between  good  SEP  performance,  latchup  immunity,  speed,  and  density. 

6 SEP  Hardening  Techniques  For  Memory  Circuits 

In  the  design  of  memory  systems,  several  techniques  may  be  used  to  provide  SEP-insensitive 
memory  systems.  These  include  the  use  of  redundant  memory  with  voting  logic  and/or 
error  detection  and  correction.  Both  of  these  techniques  require  additional  system  overhead 
and  result  in  a degradation  of  system  performance,  as  well  as  increased  cost  and  weight. 
Some  systems  cannot  afford  this  additional  overhead,  and  therefore  require  the  use  of 
SEP-hard  SRAMs. 

Several  SEP  hardening  techniques  for  SRAM  memory  cells  have  been  reported  in  the 
literature.  These  include  the  use  of  cross-coupled  resistors,  cross-coupled  capacitors,  and 
cross-coupled  p-channel  transistors.  [7,8]  A schematic  diagram  of  a memory  cell  with  cross- 
coupled  resistors  (as  used  in  UTMC’s  rad-hard  64K  SRAM)  is  shown  in  figure  4.  All  of 
these  techniques  serve  to  increase  the  write  time  constant  of  the  cell,  thereby  increasing  the 
effective  critical  charge  on  the  internal  nodes  of  the  cell.  All  of  the  techniques  increase  the 
wafer  fabrication  processing  complexity  and  the  area  of  the  SRAM  cell.  The  advantages 
and  disadvantages  of  each  are  shown  in  Table  1. 


vss  vss 


Figure  4:  Memory  Cell  Schematic 

Today,  cross-coupled  resistors  are  used  in  many  SEP-hard  designs  primarily  because 
the  processing  required  to  add  and  control  the  resistors  is  a relatively  straight  forward 
extension  of  standard  CMOS  SRAM  processing  techniques.  However,  if  proper  care  is  not 
taken  to  optimize  the  cell  layout  for  performance,  SEP  sensitivity,  and  resistor  tolerance, 
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SRAM  Cell 

Hardening  Technique 

Advantages 

Disad  vant  ages 

Cross- coupled 
resistors 

Known  processing. 
First  published 
technique. 

TCR  of  resistor  causes  SEP 
sensitivity  to  change 
over  temperature. 

Cell  area. 

Effects  of  resistor  geometry. 

Cross-coupled 

capacitors 

Minimum  variation  of 
SEU  with  temperature. 

Cap  oxide  defect  sensitivity. 
Cell  area. 

Cross- coupled 
transistors 

Minimum  cell  area.* 

Process  control  on  transistor 
Vt. 

* Technology  and  layout  dependent 


Table  1:  Advantages  and  Disadvantages  of  SRAM  SEP-Hardening  Techniques 


SEP  performance  may  be  greatly  degraded  at  high  temperature  (above  85  C),  and  write 
cycle  time  may  be  degraded  at  low  temperature  (-55  C). 


6.1  SEP-Hardened  Memory  Design 

We  have  taken  great  care  to  optimize  the  performance  of  the  rad-hard  64K  SRAM  over 
the  entire  military  temperature  range  (-55  C to  125  C).  The  size  of  the  transistors  In 
the  cell  were  optimized  in  concert  with  the  cross-coupled  resistor  values  using  the  SPICE 
simulation  techniques  described  earlier  to  assure  that  the  product  would  be  manufacturable 
and  meet  data  sheet  specifications,  including  a IE-10  error /bit /day  requirement,  over  the 
entire  military  temperature  range.  UTMC’s  64K  SRAM  memory  cell  is  more  stable  than 
most  SRAM  memory  cells  because  the  p-channel  transistor  in  the  64K  SRAM  memory 
cell  is  larger  than  the  n-channel  transistor  and  will  supply  almost  as  much  current.  The 
increased  p-channel  size  increases  the  switch  point  of  the  cross-coupled  inverters  from 
approximately  Vtn  (0.8  V)  to  approximately  Vdd/2  (2.5  V).  The  increased  switch  point 
requires  that  a particle-induced  voltage  pulse  on  one  of  the  storage  nodes  exceeds  Vdd/2 
instead  of  Vtn  before  the  other  storage  node  can  be  affected.  A more  stable  memory  cell 
is  harder  to  write  and  requires  additional  area.  The  increased  p-channel  transistor  size 
increases  the  die  size  by  approximately  7%. 

In  UTMC’s  64K  SRAM,  the  write  time  specification  can  be  met  even  with  the  more 
stable  memory  cell  because  the  write  circuitry  forces  the  memory  cell  columns  to  Vdd  and 
Gnd  during  a write  operation.  Although  the  write  circuitry  that  provides  Vdd  and  Gnd 
is  larger  than  a standard  write  circuit  that  only  provides  Gnd,  it  increases  the  die  size  by 

less  than  0.5%.  • ' ' - - 

The  64K  SRAM  uses  high- valued  polysilicon  resistors  in  series  with  the  gates  of  both 
cross-coupled  inverters  and  increased  capacitance  on  the  storage  nodes  to  increase  the 
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feedback  switching  time.  Increased  stability  and  increased  feedback  switching  time  pro- 
vide improved  SEP  protection.  The  64K  SRAM  can  use  a lower  valued  resistor  because 
the  memory  cell  is  more  stable  than  a typical  SRAM.  The  lower  valued  resistor  is  more 
manufacturable  and  is  less  affected  by  temperature.  The  polysilicon  resistor  process  pro- 
vides a tighter  than  typical  control  over  the  resistor  value.  This  tight  resistance  control 
provides  greater  SEP  protection  over  a wider  temperature  range  and  also  allows  shorter 
write  times.  Including  a resistor  in  the  memory  cell  increases  the  die  size  by  less  than  9%. 
The  cross-coupled  resistors  are  incorporated  in  the  single  layer  of  polysilicon  which  also 
forms  the  gates  of  the  transistors.  Metal  contacts  for  power  and  ground  are  incorporated 
in  every  cell  to  help  collect  some  of  the  charge  deposited  by  a particle  passing  through  a 
nearby  junction. 

The  effect  of  the  cross-coupled  resistors  is  to  increase  the  threshold  LET  for  SEP  upset 
and  reduce  the  effective  saturated  cross-section  of  the  device.  [9]  These  effects  for  UTMC’s 
64K  SRAM  are  shown  in  figure  5 and  6.  Using  the  LET  threshold  and  effective  cross-section 
from  these  graphs,  the  error  rate  in  errors/bit/day  as  a function  of  resistor  value  may  be 
calculated  for  any  space  environment  using  CREME  or  SpaceRad  as  described  above.  The 
error  rate  at  125  C (worst  case)  for  UTMC’s  64K  SRAM  is  shown  in  figure  7 as  a function 
of  resistor  value.  By  screening  devices  at  the  wafer  level  for  resistor  value  and  p-channel 
drive  current,  we  can  guarantee  an  error  rate  of  less  than  1.0E-10  errors  per  bit  per  day  at 
125  C. 
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Figure  5:  LET  Threshold  vs.  Resistor  Figure  6:  Error  Rate  vs.  Resistor  Value 
Value 


6.2  SEP-Hardened  Flip-Flop  Design 

In  logic  systems,  storage  nodes  such  as  flip-flops  must  retain  data  reliably,  or  the  integrity 
of  the  logic  system  can  be  severely  compromised,  A SEP-induced  upset  of  a single  bit  in  a 
microprocessor  register  can  send  the  system  into  an  irrecoverable  state.  Detection  of  these 
types  of  errors  can  require  substantial  overhead  in  software  and  hardware  complexity  and 
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Figure  7:  Error  Rate  vs.  Resistor  Value 


is  possible  to  substantially  improve  the  SEP  performance  without  introducing  additional 
processing  complexity.  However,  there  is  usually  a penalty  iu  increased  hJea. 

If  the  wafer  fabrication  process  technology  has  poly-resistors  available  (used  for  SEP- 
hardening  SRAM  cells  described  above),  these  resistors  maybe  used  to  increase  flip-flop 
hardness  as  shown  in  figure  8.  This  technique  can  result  in  some  performance  degradation 
of  the  flip-flop  over  temperature  as  reported  by  Sexton  et  al.  [lO]  If  resistors  are  not  avaiL 
able,  circuit  techniques,  coupled  with  the  simulation  techniques  similar  to  those  described 
above,  can  be  used  to  develop  flip-flop  register  cells  which  have  improved  SEP  performance 
over  conventional  flip-flop  circuits.  [11,12] 

To  determine  the  effectiveness  of  the  simulation  techniques  described  above,  we  sim- 
ulated the  SEP  upset  for  a number  of  the  flip-flop  ceils  in  UTMC’s  gate-array  library. 
The  simulations  for  one  of  these  cells,  DFAPCB  - a D-type  flip-flop  jwhose  logic  diagram  is 
shown  in  figure  9,  was  compared  with  experimental  data  from  heavy  ion  tests  performed  at 
Brookhaven  National  Laboratory.  To  accurately  determine  the  upset  rate  of  the  flip-flop, 
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Figure  9:  D-type  Flip-flop  Logic  Diagram 

it  was  necessary  to  determine  the  effective  critical  charge  (the  charge  required  to  upset  the 
state  of  the  flip-flop)  for  every  node  within  the  circuit  for  both  a high-node  state  and  a 
low-node  state.  The  effective  critical  charge,  along  with  the  junction  area  for  each  node, 
was  then  used  to  determine  the  upset  rate  for  the  node  using  the  SpaceRad  program.  The 
sum  of  the  upset  rates  for  all  nodes  was  then  taken  as  the  upset  rate  for  the  flip-flop.  The 
simulated  error  rate  for  the  DFAPCB  flip-flop  was  4.24E-8  errors  per  cell  per  day.  This 
compares  to  an  experimentally  determined  error  rate  of  3.38E-8  errors  per  cell  per  day  for 
this  flip-flop  cell. 

We  have  also  developed  SEP-upset  improved  cells  which  support  our  radiation-hard 
gate  array  cell  library.  Some  of  the  cells  are  capable  of  providing  an  error  rate  of  1.0E- 
10  errors  per  cell  per  day.  Fully  redundant  cells  have  also  been  designed  which  require 
twice  as  many  transistors  as  a non-redundant  design,  but  are  SEP-immune.  The  results  of 
this  work  have  provided  a number  of  guidelines  for  selecting  and  designing  an  SEP-hard 
flip-flop.  These  guidelines  are  discussed  in  Table  2. 


7 Conclusions 

Use  of  the  simulation  techniques  described  in  this  paper  substantially  increases  the  con- 
fidence that  a design  will  meet  its  objectives  for  SEP  hardness,  and  the  cell  layout  can 
be  optimized  without  compromising  circuit  performance.  We  have  demonstrated  an  SEP 
error  rate  less  than  1.0E-10  errors/bit-day  for  a 90%  worst-case  geosynchronous  orbit  en- 
vironment over  the  entire  -55  C to  +125  C temperature  range  for  a rad-hard  64K  SRAM 
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Problem 

Flip-Flop  Design  Approach 

Minimize  stacked 
p-channel  devices 
Eliminate 

transmission  gates 

Remove  NOR  gates  and  replace  with  NAND  gates. 

Remove  transmission  gates  and  replace  with  clocked 
inverters. 

Minimize  sensitive 
node  area 

Simplify  design  as  much  as  possible  consistent  with 
functional  requirements.  Add  redundant  (parallel) 
transistors  on  sensitive  nodes  internal  to  the  cell. 

Table  2:  Design  Considerations  for  Improving  SEP  Hardness  of  Flip-Flops 


while  maintaining  a less  than  55  ns  cycle  time.  We  have  also  demonstrated  the  capability 
to  model  the  SEP  error  rate  of  gate  array  cells  and  have  applied  the  simulation  and  design 
techniques  described  in  this  paper  to  develop  SEP-tolerant  and  SEP-hard  flip-flop  designs. 

SEP  Hardness  of  integrated  circuits  cannot  be  assured  by  screening  commercial  devices, 
or  by  normal  system-level  or  logic-level  design  techniques.  Good  SEP  hardness  can  only 
be  obtained  by  using  a wafer  fabrication  process  which  provides  the  proper  characteristics 
and  proper  attention  to  good  design  practices  at  the  transistor  level.  Since  commercial 
semiconductor  manufacturers  do  not  consider  SEP  effects  when  designing  their  circuits, 
it  will  be  necessary  to  develop  custom  circuits  which  are  designed  for  SEP  hardness  for 
mission- critical  applications. 
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Pulse-Firing 

Winner- Take- All  Networks 

Jack  L.  Meador 

School  of  Electrical  Engineering  and  Computer  Science 
Washington  State  University 
Pullman  WA,  99164-2752 

Abstract-  Winner-take-all  (WTA)  neural  networks  using  pulse-firing  process- 
ing elements  are  introduced.  In  the  pulse-firing  WTA  (PWTA)  networks  de- 
scribed, input  and  activation  signal  shunting  is  controlled  by  one  shared  lat- 
eral inhibition  signal.  This  organization  yields  an  O(n)  area  complexity  that  is 
convenient  for  integrated  circuit  implementation.  Appropriately  specified  net- 
work parameters  allow  for  the  accurate  continuous  evaluation  of  inputs  using  a 
signal  representation  compatible  with  established  pulse-firing  neural  network 
implementations. 

1 Introduction 

The  winner-take-all  (WTA)  function  plays  a central  role  in  competitive  neural  networks 
and  is  related  to  recurrent  on-center  off-surround  models  of  natural  neural  systems  [1-3]. 
Although  it  can  be  realized  sequentially  via  pairwise  comparisons,  the  WTA  operation  is 
more  effectively  realized  in  parallel  analog  circuits  via  a distributed  network  of  processing 
elements  which  compare  relative  input  magnitudes  and  allow  only  that  element  with  the 
largest  input  (or  ’’winner”)  to  remain  active.  Parallel  analog  WTA  realizations  have  been 
described  which  use  Hopfield  Network  dynamics  [4],  and  MOS  current  conveyors  [5,6]. 
The  model  introduced  in  this  paper  and  its  electronic  implementation  are  more  like  a 
WTA  mechanism  inspired  by  natural  presynaptic  inhibition  feedback  [7].  The  new  pulse- 
firing  WTA  (PWTA)  model  employs  a unique  combination  of  a self-shunting  feedback 
term  with  output  hysterisis  to  yield  a WTA  network  compatible  with  asynchronous  pulse- 
firing  neural  network  implementations  described  variously  as  impulse,  pulse-stream,  and 
neural-type  networks  [8-10]. 

This  paper  first  introduces  asynchronous  pulse  firing  processing  units  in  Section  2. 
These  are  the  basic  computational  units  used  in  PWTA  networks.  The  mathematical 
foundation  of  PWTA  networks  is  then  presented  in  Section  3 where  the  system  dynamics 
of  a general  PWTA  network  are  developed.  Section  4 continues  with  the  presentation  of 

MOS  circuit  implementations.  Section  5 closes  with  an  analysis  of  finite  circuit  precision 
effects. 


2 Asynchronous  Pulse  Firing  Processing  Units 

The  dynamics  of  the  pulse  firing  processing  units  used  in  a PWTA  network  obey  the 
following  equations: 
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kv  = — av  + x — {fiv  + x — j)g{v),  u(<o)  = 0 (1) 

y = g(v) 

where  v is  unit  activation,  x is  total  unit  input,  y is  unit  output,  and  g(-)  is  the  binary 
hysterisis  function  shown  in  Figure  1.  As  can  be  deduced  from  the  figure  g(v)  includes  as  a 
special  case  the  simple  threshold  nonlinearity  (when  Vti  = Vth)  although  the  specific  pulse 
firing  dynamics  described  here  would  cease  to  exist  in  that  situation.  Throughout  this 
paper  input  x is  assumed  to  be  positive  and  time  variant.  The  unit  response  to  a constant 
input  is  a train  of  regularly  spaced  constant  width  pulses.  The  larger  the  input  signal 
z,  the  greater  the  output  pulse  repetition  rate.  The  parameter  a establishes  a first-order 
response  to  x during  the  input  integration  phase  of  operation  as  defined  by  the  absence  of 
an  output  pulse  (y  = 0).  That  response  is  shifted  to  one  defined  by  a + (3  during  the  firing 
phase,  as  defined  by  the  presence  of  an  output  pulse  (y  = 1).  Since  x is  shunted  during  the 
output  pulse  period,  processing  element  state  asymptotically  approaches  e = 7/ (a  4-  (S). 
Parameter  k uniformly  scales  all  unit  time  constants.  One  pulse  firing  cycle  is  summarized 
by  the  integration  of  the  input  signal  until  v reaches  Vth,  whereupon  the  switches  toggle, 
causing  the  discharge  of  v to  min(e,  Vjj).  Oscillation  is  sustained  provided  e < Vj/. 


Figure  1:  Output  hysterisis  function 


3 PWTA  Network  Dynamics 

A PWTA  network  combines  the  unit  dynamics  described  in  (2)  with  lateral  inhibition.  Lat- 
eral inhibition  from  a combination  of  unit  outputs  can  be  expressed  in  a form  which  yields 
network  state  equations  similar  to  those  of  the  presynaptic  inhibition  model  described  by 
Yuille  and  Grzywacz  [7]: 


KVi  — —avi  + XiF{V)  - ff(V) 


(2) 
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where 


F(V)  = 1-VsM 

k 

and 

#(v)  = ((A  - P)vi  + 7 - s)g(vi)  + (flvi  - 7 )\/g(vk) 

k 

with  V indicating  the  logical  OR  operation  and  V corresponding  to  the  vector  of  unit 
activations.  F(V)  is  a binary  value  establishing  the  input  inhibition  which  occurs  while 
any  processing  element  generates  a pulse.  It  is  during  this  ’’output  firing  phase”  of  the 
network  that  H(V)  controls  the  processing  elements  such  that  a winning  unit  activation 
decays  at  a different  rate  to  a different  equilibrium  than  that  of  a losing  unit.  The  model 
parameters  allow  for  the  independent  adjustment  of  winning  and  losing  unit  decay  rates 
and  asymptotes.  Properly  chosen  parameter  values  guarantee  WTA  function  independent 
of  initial  system  state  without  the  need  for  external  synchronization  [11]. 

All  units  contribute  to  a shared  lateral  inhibition  signal  identically.  The  winning  output 
is  indicated  by  the  dominance  of  the  first  unit  activation  to  reach  Vth:  since  it  establishes 
the  synchronized  re-initialization  of  all  units,  it  is  the  only  one  to  fire.  In  general,  the 
winning  unit  is  determined  by  a combination  of  initial  network  activation  state  and  input 
magnitude.  With  appropriately  chosen  parameters,  however  the  reset  state  establishes 
initial  conditions  which  make  the  winning  unit  decision  independent  of  initial  state  and 
dependent  exclusively  upon  the  X{  inputs. 

For  the  winning  unit  to  exactly  correspond  to  the  one  having  the  largest  input,  it  is 
important  that  initial  condition  independence  be  maintained.  To  guarantee  this  indepen- 
dence in  the  PWTA  network  described  by  (2)  parameters  are  chosen  such  that  all  units 
reset  to  an  identical  initial  condition. 

All  activations  in  the  PWTA  of  (2)  will  reset  to  near-identical  initial  states  if  parameters 
are  chosen  such  that  v*  < Vti,  v £ = Vt\  and  (3  <C  A [11].  During  the  output  firing  phase, 
these  values  cause  the  losing  units  to  approach  Vti  well  before  the  winning  unit  does.  When 
the  winning  unit  reaches  Vti,  it  terminates  the  output  firing  phase  and  all  activations  cease 
to  decay.  Theoretically,  the  losing  units  only  asymptotically  converge  to  Vti  while  the 
winning  unit  converges  via  a truncated  exponential.  Even  though  this  is  mathematically 
imprecise,  in  a practical  sense  it  can  be  assumed  that  /?  is  chosen  large  enough  with  respect 
to  A for  losing  units  to  converge  to  within  the  limits  of  finite  precision  hardware  well  before 
the  firing  phase  ends. 

A geometric  interpretation  of  ideal  PWTA  network  operation  with  constant  inputs  is 
illustrated  in  Figure  2.  Each  loop  in  the  state  diagram  corresponds  to  one  firing  cycle. 
Unit  activations  are  reset  to  Vti  at  state  So  in  the  diagram.  The  input  integration  phase  (1 
and  3 in  the  figure)  begins  at  S0  and  terminates  when  the  Vth  threshold  is  reached.  That 
is  followed  by  the  output  firing  phase  (2  and  4 in  the  figure)  during  which  unit  activations 
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Figure  2:  Ideal  PWTA  network  operation 


decay.  In  the  figure,  trajectory  1-2  corresponds  to  the  path  followed  when  unit  j wins, 
and  trajectory  3-4  to  when  unit  i wins.  Which  unit  wins  is  determined  by  the  one  which 
first  reaches  That  in  turn  is  determined  by  the  state  trajectory  during  the  input 

integration  phase.  With  constant  inputs,  that  trajectory  is  linear  with  slope  proportional 
to  the  quotient  of  the  input  signal  magnitudes.  A “winning  boundary”  for  constant  inputs 
can  be  identified  by  unit  slope  (dashed  line  in  the  figure).  This  geometric  interpretation 
of  PWTA  operation  will  prove  useful  in  a later  discussion  of  finite  precision  effects. 

Figure  3 illustrates  the  operation  of  a 2-unit  PWTA  network  in  response  to  a smooth 
transition  between  two  inputs.  In  this  example,  input  signals  X\  and  x?  move  from  0 to  1 
and  from  1 to  0 respectively,  crossing  at  t=10.  The  parameters  chosen  for  this  simulation 
are  Vtl  = 1,  Vth  = 4,  K = 0.1  a = 0.1,  /?  = 1.2,  7 = 1.3,  and  A = 0.6.  The  activation  state 
space  diagram  for  this  simulation  is  shown  in  Figure  4.  It  can  be  seen  how  the  reset  state 
with  these  parameters  assures  input  order  preservation. 


4 CMOS  Circuit  Implementation 

By  way  of  introduction  to  the  CMOS  PWTA  network,  a CMOS  implementation  of  a 
pulse  firing  processing  element  shall  first  be  considered.  Figure  5 shows  the  circuit  for 
an  impulse  neural  circuit  as  described  previously  in  [8].  For  simplicity,  7 = 0,  and  a is 
for  practical  purposes  nonexistent  by  virtue  of  the  low  leakage  currents  exhibited  in  MOS 
technology.  CK  includes  not  only  the  ideal  capacitance  of  a poly-1  capacitor,  but  also 
stray  wiring  capacitance  and  the  input  capacitance  exhibited  by  the  Schmitt  trigger  G. 
The  Schmitt  trigger  provides  high  voltage  gain  at  the  threshold  voltages  Vti  and  Vthi  with 
positive  feedback  from  the  output  establishing  the  active  threshold.  The  Schmitt  trigger 
output  can  be  expressed  in  terms  of  the  hysterisis  function  g of  (2)  as  G(t>)  — Vddq{v)'  ft 
corresponds  to  the  channel  conductance  of  M2  which  operates  in  the  active  region  when 
an  output  pulse  is  generated.  Further  details  regarding  the  operation  of  this  circuit  are 


3rd  NASA  Symposium  on  VLSI  Design  1991 


5.1.5 


Figure  3:  Inputs,  state  variables  and  output  of  a 2-unit  PWTA  network 


Figure  4.  State  trajectory  of  a 2-unit  PWTA  network  simulation 
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Figure  5:  CMOS  implemantation  of  a pulse  firing  processing  unit 
provided  in  [8]. 

A CMOS  implementation  of  a PWTA  cell  is  shown  in  Figure  6.  The  basic  elements  of 
the  CMOS  impulse  circuit  are  augmented  by  additional  MOSFETs  which  establish  various 
parameters  associated  with  the  ideal  model.  Variations  of  this  circuit  having  reduced 
transistor  counts  are  also  possible  [11].  The  circuit  of  Figure  6 simply  represents  the  most 
general  CMOS  PWTA  implementation  consistent  with  the  Equation  (3)  definition. 

Two  local  signals  and  one  global  signal  control  circuit  operation  in  accordance  with 
current  network  state.  The  local  signals  Gi  and  Gi  indicate  that  unit  t is  the  winning  unit 
when  Gi  = VDD  and  GU  = OV.  These  signals  select  the  localjfiring  response.  Both  the 
true  and  complemented  global  lateral  inhibition  signal  F and  F are  derived  by  a pseudo- 
NMOS  NOR  gate  and  a CMOS  inverter  consisting  of  transistors  Mil  through  M14  in  the 
diagram.  These  signals  are  distributed  on  two  wires  between  all  cells  of  the  WTA  network. 
NOR  pulldown  transistors  (Mil)  are  distributed  across  all  cells  while  there  need  only  exist 
a single  pullup  transistor  (M12)  and  single  inverter  (M13,  M14).  When  any  unit  in  the 
network  initiates  a pulse,  F becomes  active,  causing  all  units  to  enter  the  output  firing 
phase. 

Transistor  Ml  disconnects  input  current  a c;  during  the  output  firing  phase,  causing  it 
to  be  shunted  into  the  parallel  capacitance  of  some  input  circuit  (not  shown  - see  [8]  for 
further  details).  Also  during  the  firing  phase,  transistors  M2,  M7,  M6,  and  M10  conduct, 
allowing  some  combination  of  the  currents  Ix  through  I4  to  flow. 

The  circuit  branch  consisting  of  M2  through  M4  establishes  a current  which  corresponds 
to  the  constant  I\  — 7.  Similarly,  branch  M7  through  M9  establishes  a constant  current 
corresponding  to  I3  = £.  Ignoring  the  nonlinear  component  of  channel  conductance,  the 
branch  consisting  of  transistor  M10  establishes  a current  corresponding  to  I4  = and 
the  M5,  M6  branch  a current  analogous  to  (0  — A)i>i.  As  with  the  circuit  of  Figure  5,  a is 
assumed  to  be  negligible. 

During  the  output  phase,  the  signals  G and  overlineG  control  the  unit  response.  If 
the  unit  is  a winner,  then  G = VpDi  G = OV , and  branch  currents  I3  and  J4  are  allowed 
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to  flow.  This  establishes  a winning  unit  response  where  the  unit  activation  asymptotically 
decays  to  £/A,  but  is  truncated  at  Vti  when  the  winning  unit  terminates  the  output  firing 
phase.  If  the  unit  is  a loser,  then  G = OV , G = Vdd,  and  branch  currents  I1(  J2,  and 
I3  are  allowed  to  flow.  This  establishes  a losing  unit  response,  where  the  unit  activation 
asymptotically  approaches  Vti  at  a much  faster  rate  than  it  would  otherwise  as  the  winning 
unit. 


5 Finite  Precision  Effects 

Thus  far,  the  effects  of  finite  parameter  precision  have  been  ignored.  Intra-cell  parameter 
variation  will  contribute  to  deviations  from  the  ideal  performance  previously  described. 
This  section  focuses  upon  the  parametric  variations  which  will  have  the  greatest  effect 
upon  PWTA  performance  that  also  are  the  most  likely  to  occur  in  contemporary  CMOS 
fabrication  processes. 

The  overall  function  of  a PWTA  network  is  to  select  the  input  signal  having  the  greatest 
magnitude.  Inspection  of  Figure  2 reveals  that  there  are  two  potential  error  sources  which 
can  interfere  with  that  function.  These  are  errors  in  the  determination  of  the  initial 
network  state,  S0  and  deviations  in  the  position  of  the  winning  boundary.  These  variations 
effectively  give  an  unfair  advantage  to  some  processing  units,  sometimes  allowing  units  to 
fire  even  thought  their  inputs  are  not  necessarily  the  largest.  Fortunately,  it  can  be  shown 
that  this  occurs  only  when  two  inputs  have  very  nearly  the  same  magnitude.  Units  having 
input  signals  which  are  “clearly”  not  the  largest  will  remain  quiescent.  The  definition  of 
clearly  is  expressed  as  a hysterisis  deadband  which  naturally  occurs  around  the  winning 
boundary.  This  hysterisis  arises  directly  from  parametric  variation. 

For  the  remainder  of  this  section  only  constant  inputs  will  be  considered.  This  allows 
for  the  analysis  of  parameter  precision  effects  while  the  network  is  in  a steady-state  op- 
erating condition.  Figure  2 illustrates  network  operation  under  ideal  conditions  when  the 
critical  parameters  Vti}  Vth,  /c,  7 and  (3  are  assumed  identical  across  all  units.  Under  these 
conditions,  unit  i wins  if 


dvi 

dvj 


(14) 


with  unit  j winning  otherwise,  gt  = 1 corresponds  to  the  ideal  winning  boundary 
(dashed  line)  of  Figure  2. 

Variations  in  the  scaling  constant  k yield  an  inaccurate  winning  boundary  definition. 
k is  determined  in  the  CMOS  implementation  by  the  MOS  capacitor  CK  in  the  previous 
circuit  diagram.  Variations  in  capacitor  geometry  will  lead  to  inter-unit  k variation  and 
subsequently  give  those  units  having  a smaller  k an  advantage  in  the  race  toward  Vth. 
Recognizing  that 
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yields  the  decision  rule  that  unit  i wins  if 

in  > H (16) 

dVj  Ki 

This  rule  reduces  to  the  ideal  one  of  (14)  when  k i = Kj  Although  all  units  are  initialized 
to  the  same  state  at  -So,  the  decision  boundary  is  shifted  such  that  one  unit  is  favored  over 
another. 

Variation  of  CK  alone  leads  to  finite  precision  for  the  winning  boundary.  Parameter 
variations  which  affect  the  initial  network  state  and  the  unit  firing  thresholds  lead  to  more 
complex  hysterisis  effects  at  the  winning  boundary.  Firing  thresholds,  as  defined  by  Vth 
in  the  Schmitt  trigger  are  typically  determined  by  device  geometries.  The  initial  network 
state  is  determined  in  part  by  Vti  which  is  also  dependent  upon  Schmitt  device  geometries. 
The  other  part  of  initial  network  state  is  determined  by  the  geometries  of  M2-M6  and 
M10  of  Figure  6 (corresponding  to  /3,  7 and  A in  the  ideal  equations).  Not  only  do  such 
variations  change  the  slope  of  the  winning  boundary,  but  the  slope  is  also  dependent  upon 
the  winning  unit  as  well.  Under  these  conditions,  the  current  winner  is  favored  by  the  initial 
states  such  that  a new  winner  must  have  a significantly  larger  input  than  the  present  one. 
These  effects  can  be  used  to  extend  the  decision  rule  expressed  by  (16)  to  one  where  unit 
i wins  if 


dV{  Kj  Vthi  Vtfi  (17) 

dVj  > Ki  Vthj  — V0j 

where  Vo«  and  Vo j are  determined  by  the  combined  variations  of  M2-M6  and  M10 
between  units  i and  j.  It  can  be  easily  verified  that  this  decision  rule  reduces  to  the  ideal 
case  when  there  is  no  inter-unit  variation.  The  effect  this  has  on  overall  PWTA  function 
is  to  introduce  a hysterisis  deadband  which  only  affects  close  decisions. 

6 Conclusion 

An  ideal  linear  model  has  been  used  to  establish  a general  basis  for  PAVTA  function. 
This  model  improves  an  earlier  one  based  upon  presynaptic  inhibition  in  two  ways.  The 
new  model  uses  0(n)  interconnect  for  lateral  inhibition  and  does  not  require  an  external 
reset  signal  since  it  is  fully  asynchronous.  Furthermore,  it  provides  information  regarding 
how  strong  a winning  input  is  — a feature  not  always  found  in  winner- take- all  networks. 
Model  parameters  can  be  chosen  to  guarantee  ideal  winner-take- all  function  given  precise 
parameter  specifications.  The  model  is  also  fully  compatible  with  previously  established 

asynchronous  pulse  firing  analog  neural  ICs. 

CMOS  PWTA  circuits  have  also  been  presented.  These  circuits  necessarily  deviate 
from  the  ideal  linear  model,  but  they  can  be  designed  to  exhibit  similar  behavior  simply 
by  accounting  for  the  nonlinear  characteristics  of  the  electronic  devices  they  employ.  Non- 
ideal effects  arising  from  practical  implementation  considerations  have  also  been  addressed. 
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Figure  6:  A genral  CMOS  PWTA  cell 

Parametric  variation  between  pulse-firing  processing  units  leads  to  a finite  decision  accu- 
racy and  the  potential  existence  of  a hysterisis  deadband. 
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Abstract-  This  paper  discusses  the  training  of  product  neural  networks  using  ge- 
netic algorithms.  Two  unusual  neural  network  techniques  are  combined;  prod- 
uct units  are  employed  instead  of  the  traditional  summing  units  and  genetic 
algorithms  train  the  network  rather  than  backpropagation.  As  an  example, 
a neural  network  is  trained  to  calculate  the  optimum  width  of  transistors  in 
a CMOS  switch.  It  is  shown  how  local  minima  affect  the  performance  of  a 
genetic  algorithm,  and  one  method  of  overcoming  this  is  presented. 


1 Introduction 

Neural  networks  have  been  applied  successfully  to  many  problems  in  recent  years.  Tradi- 
tionally these  networks  are  composed  of  multiple  layers  of  summation  units.  These  simple 
units  sum  their  inputs,  each  input  multiplied  by  a variable  weight.  This  summation  is 
usually  then  squashed  by  a non-linear  equation  such  as  the  logistic  function.  Several  re- 
searchers have  shown  that  networks  composed  of  these  units  can  calculate  any  function  to 
any  arbitrary  degree  of  accuracy  given  enough  summation  units.  [1]  However,  there  are 
many  functions  that  are  complicated  enough  that  the  number  of  summation  units  it  takes 
to  duplicate  them  are  prohibitive.  One  very  commonly  found  task  is  that  of  higher  order 
combinations  of  the  inputs  such  as  either  X * X or  X *Y. 

One  proposed  solution  is  a new  unit  called  the  “sigma- pi  unit”  [3],  This  unit  not  only 
applies  a weight  to  each  input,  but  also  applies  a weight  to  the  second  and  possibly  higher 
order  products  of  the  inputs.  While  much  more  powerful  than  the  traditional  summation 
unit,  the  number  of  weights  increase  very  rapidly  with  the  number  of  inputs,  and  soon 
become  unmanageable  when  applied  to  solving  large  problems.  Since  most  problems  only 
need  one,  or  at  most  a few,  of  these  terms,  the  sigma-pi  unit  is  overkill. 

1.1  Product  Units 

A suitable  alternative  was  introduced  by  Durbin  and  Rumelhart  [2].  The  “product  unit 
computes  the  product  of  its  inputs,  each  raised  to  a variable  power.  This  is  shown  in 
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Figure  1:  Recommended  product  network  configurations  [4] 


Equation  1, 

(i) 

The  p(i ) terra  is  treated  in  the  same  way  as  the  variable  weights  for  summation  units. 
Using  the  modified  version  of  backpropagation  presented  by  Durbin  and  Rumelhart,  these 
product  units  can  provide  much  more  generality  than  sigma-pi  units.  While  a sigma-pi 
unit  is  constrained  to  using  just  polynomial  terms,  the  product  units  can  use  fractional 
and  even  negative  terms,  As  Durbin  and  Rumelhart  point  out,  product  units  can  actually 
be  considered  a superset  of  sigma-pi  units;  for  if  several  of  the  product  units  are  used,  and 
they  are  constrained  to  only  integer  values,  they  would  have  the  same  results. 

There  are  many  ways  that  product  units  can  be  used  in  a network.  However,  the 
overhead  required  to  raise  an  arbitrary  base  to  an  arbitrary  power  makes  it  unlikely  that 
they  will  replace  summation  units.  Durbin  and  Rumelhart  propose  that  the  primary  use 
of  the  product  units  will  be  to  supplement  the  power  of  the  summation  units.  Two  pro- 
posed architectures  are  shown  in  Figure  1.  The  term  product  neural  networks  (or  product 
networks)  will  be  used  to  refer  to  networks  containing  both  product  and  summation  units. 

While  product  units  increase  the  capability  of  a neural  network,  they  also  add  compli- 
cations, Not  only  is  backpropagation  harder  to  accomplish,  but  the  solution  space  becomes 
more  convoluted.  As  Durbin  and  Rumelhart  pointed  out,  there  are  often  local  minima  that 
trap  the  network.  As  a possible  solution  to  this  problem,  this  paper  investigates  the  use 
of  genetic  algorithms  to  train  product  networks. 

2 Genetic  Algorithms 

2.1  Introduction 

A genetic  algorithm  (GA)  is  an  exploratory  procedure  that  is  able  to  locate  near-optimal 
solutions  to  complex  problems.  To  do  this,  it  maintains  a set  (called  a population)  of  trial 
solutions  (called  chromosomes).  Through  a repeated  four-step  process,  these  chromosomes 
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evolve  until  an  acceptable  solution  is  found.  These  steps  are  evaluation,  reproduction, 
breeding,  and  mutation.  A representation  for  possible  solutions  must  first  be  developed. 
Then,  with  an  initial  random  population,  the  GA  is  able  to  solve  the  problem  almost 
without  regard  to  the  interpretation  of  the  chromosome.  Each  generation,  the  chromosomes 
produced,  through  survival-of-the-fittest  and  exploitation  of  old  knowledge  in  the  gene 
pool,  should  have  an  improved  ability  to  solve  the  problem. 

There  were  two  primary  reasons  why  GAs  were  applied  to  training  product  networks. 
First  was  that  the  addition  of  product  nodes  made  the  solution  space  more  complicated. 
Because  backpropagation  is  a gradient  decent  algorithm,  it  is  very  likely  to  get  caught 
in  a local  minimum.  The  second  reason  was  that  backpropagation  tends  to  be  slow  with 
complicated  problems.  It  often  takes  many  iterations  to  train  a complicated  network.  It  is 
hoped  that  the  use  of  a GA  will  find  the  best  answer  much  faster  than  backpropagation. 

2.2  Representation 

Before  applying  a genetic  algorithm  to  any  task,  a representation  for  possible  solutions 
must  be  found.  The  most  common  method  for  representing  these  possible  solutions  is  with 
a bit  string.  Higher  order  strings  (such  as  character  strings)  or  trees  (such  as  binary  trees) 
have  also  been  used.  Since  the  architecture  of  the  product  networks  to  be  trained  will 
be  known,  a binary  string  representation  with  a fixed  number  of  bits  per  weight  can  be 
constructed.  Thus,  each  weight  in  the  network  has  a certain  number  of  bits  associated  with 
it.  This  representation  permits  each  chromosome  to  be  decoded  easily,  while  still  allowing 
each  weight  a large  degree  of  freedom.  The  typical  generation  used  had  between  30  to  100 
members  in  a population,  with  16  bits  representing  a weight. 

2.3  Evaluation 

The  first  step  in  any  generation  is  the  evaluation  of  the  current  chromosomes.  This  is  the 
only  step  where  the  interpretation  of  the  chromosome  is  used.  Each  chromosome  in  the 
population  is  decoded,  and  the  result  is  used  to  solve  the  original  problem.  This  solution 
is  then  graded  on  how  well  it  solved  the  problem.  The  method  used  to  grade  product 
networks  is  to  calculate  the  sum  of  squared  error  (SSE)  for  the  training  set.  The  fitness  of 
the  chromosome  is  equal  to  1/(1  + SSE).  This  means  that  the  better  a network  performs, 
the  higher  its  fitness,  with  a perfect  network  having  a fitness  of  1. 

2.4  Reproduction 

The  next  step  in  a generation  is  to  create  a new  population  based  upon  the  evaluation 
of  the  previous  one.  Every  chromosome  generates  a specific  number  of  copies  of  itself, 
based  on  how  well  it  solved  the  problem.  Thus  the  chromosomes  that  performed  better 
will  produce  several  copies  of  themselves,  while  the  worst  chromosomes  won’t  produce 
any  copies.  This  is  the  step  that  allows  GAs  to  take  advantage  of  a survival-of-the-fittest 
strategy. 
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There  are  several  methods  to  calculate  the  number  of  offspring  that  each  chromosome 
will  have.  One  of  the  more  prevalent  methods  is  called  ratioing.  With  ratioing,  each 
chromosome  produces  a number  of  offspring  directly  related  to  its  fitness,  with  the  only 
restriction  being  that  the  total  number  of  chromosomes  per  generation  remains  constant. 
Thus,  if  one  chromosome  has  a fitness  that  is  twice  that  of  another,  then  the  superior  chro- 
mosome would  produce  twice  as  many  offspring.  However,  there  are  two  major  problems 
with  this  method.  First,  if  all  the  chromosomes  have  similar  fitness,  each  member  in  the 
population  will  produce  one  offspring.  This  results  in  little  pressure  toward  improving  the 
solution.  The  second  problem,  although  from  a different  source,  has  the  same  effect.  If  any 
one  chromosome  should  happen  to  have  a fitness  much  larger  than  any  of  the  others,  then 
that  chromosome  would  create  most,  if  not  all  of  the  new  offspring.  This  discriminates 
against  the  remaining  information  of  the  gene  pool  in  favor  of  this  super-chromosome,  loos- 
ing the  information  in  the  gene-pool.  This  particular  type  of  stagnation  has  been  labeled 
premature  convergence. 

The  method  the  author  used  to  train  the  product  networks  is  ranking  [5].  In  ranking, 
the  whole  population  is  sorted  by  fitness.  The  number  of  offspring  each  chromosome  will 
generate  is  then  determined  by  where  it  falls  in  the  population.  The  ranking  algorithm 
used  was  that  the  top  30%  of  the  population  generated  two  offspring  each,  the  bottom 
30%  of  the  population  generated  no  offspring,  and  the  rest  of  the  population  each  gener- 
ated one  offspring.  In  this  way,  no  one  chromosome  can  overpower  the  population  in  a 
single  generation,  and  no  matter  how  close  the  actual  fitness  values  are,  there  is  always 
constant  pressure  to  improve.  While  the  problem  of  premature  convergence  still  exists,  it 
is  greatly  reduced  by  allowing  other  chromosomes  a chance  to  mix  information  with  high 
fitness  chromosomes.  The  disadvantage  of  using  ranking  is  speed.  In  not  allowing  better 
chromosomes  to  guide  the  population  easily,  good  answers  are  slower  to  develop. 

2.5  Breeding 

The  previous  step,  reproduction,  created  a population  whose  members  currently  best  solve 
the  problem.  However,  many  of  the  chromosomes  are  identical  and  none  are  different  than 
those  in  the  previous  generation.  Breeding  combines  chromosomes  from  the  population 
and  produces  new  chromosomes  that,  while  they  did  not  exist  in  the  previous  generation, 
maintain  the  same  gene  pool.  In  natural  evolution,  breeding  and  reproduction  are  the 
same  step,  but  in  GAs  they  have  been  separated  to  allow  different  methods  for  each  to  be 
experimented  with  and  independently  evaluated.  It  is  in  this  step  where  GAs  can  exploit 
knowledge  of  the  gene  pool  by  allowing  good  chromosomes  to  combine  with  chromosomes 
that  aren’t  as  good.  This  is  based  on  the  assumption  that  each  individual,  no  matter  how 
good  it  is,  doesn’t  contain  the  answer  to  the  problem.  The  answer  is  contained  in  the 
population  as  a whole,  and  only  by  combining  chromosomes  will  the  correct  answer  be 
found. 

There  are  many  methods  used  for  breeding;  with  the  most  common  being  crossover. 
Crossover  typically  takes  two  chromosomes  and  swaps  parts  of  each  to  create  two  new 
chromosomes.  Many  variations  on  crossover  have  been  used,  but  no  results  have  shown 
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Figure  2:  Example  of  two-point  crossover  (crossover  points  indicated  by  arrows) 

which  is  decisively  better.  The  crossover  the  author  used  to  train  the  product  networks  was 
a simple  two-point  crossover.  Two  random  points  are  chosen  in  the  chromosome,  and  the 
bitstring  between  the  two  points  is  swapped  between  the  two  chromosomes.  An  example 
is  shown  in  Figure  2. 


2.6  Mutation 

The  last  step  in  creating  a new  generation  is  based  on  the  assumption  that  while  each 
generation  is  better  than  the  previous,  the  individuals  that  die  may  have  some  information 
that  is  essential  to  the  solution.  It  is  also  possible  that  the  initial  population  didn’t  have 
all  the  necessary  information.  The  reinjection  of  information  into  the  population  is  called 
mutation.  Again,  there  are  many  ways  to  implement  mutation,  but  essentially  all  choose 
and  change  members  of  the  population  randomly. 

The  method  the  author  used  was  to  simply  inject  a constant  number  of  mutations  every 
generation.  The  number  of  mutations  used  was  approximately  0.25%  of  the  total  number 
of  bits  in  the  entire  population.  These  mutations  where  then  randomly  distributed  among 
all  the  bits,  with  each  bit  having  the  same  chance  of  mutating.  A mutation  involved  a 
50/50  chance  of  setting  the  bit  to  a 1 or  0,  in  effect  giving  the  mutated  bit  a 50/50  chance 
of  changing.  This  means  that  any  specific  chromosome  may  or  may  not  mutate,  with  a 
small  chance  that  it  could  severely  mutate. 

2.7  An  Application 

A product  network  was  trained  that  calculates  the  optimum  width  of  the  transistors  in 
a CMOS  switch  given  temperature,  power  supply  voltage,  and  minimum  conductance  as 
inputs.  While  there  are  many  excellent  analysis  tools  available,  such  as  circuit  simulators, 
there  are  almost  no  software  packages  available  that  transform  performance  specifications 
into  a circuit  schematic.  This  network  is  designed  as  an  aid  to  CMOS  circuit  designers, 
and  was  first  proposed  by  Thelen  in  [4]. 

The  data  used  to  train  the  network  was  extracted  from  several  SPICE  simulations  with 
differing  transistor  dimensions,  temperatures,  and  power  supply  voltages.  In  the  training 
set  created  from  this  data,  the  voltages  ranged  from  3 to  12  volts,  the  temperature  from 
303  to  403  °K,  and  the  transistor  width  from  2 to  20  micrometers.  Using  these  inputs, 
the  conductance  could  range  from  approximately  1 to  500  micro-mhos.  Two  hundred  data 
points  were  collected  and  a sample  from  these  points  is  shown  in  Table  1. 
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Voltage 

Temperature  (°K) 

Conductance 

Desired  Width 

3 

303 

1.026E-6 

2 

3 

303 

3.806E-6 

3 

3 

303 

6.593E-6 

4 

3 

303 

1.204E-5 

6 

3 

303 

1.752E-5 

8 

3 

303 

2.851E-5 

12 

3 

303 

3.951E-5 

16 

3 

303 

6.152E-5 

24 

Table  1:  Sample  from  the  data  points  used  to  train  the  network 
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Figure  3:  The  product  neural  network  trained  to  select  the  width  of  a CMOS  switch 


The  configuration  of  the  product  network  was  designed  by  Thelen  using  a priori  infor- 
mation about  tlie  equations  to  mode]  a CMOS  switch.  The  layout  of  the  network  is  shown 
in  Figure  3.  ^ 


3 Results 

The  first  attempts  at  training  the  product  network  had  very  consistent,  but  wrong,  results. 
Through  many  runs  of  the  GA,  every  solution  represented  a network  that  gave  outputs  of 
approximately  ten  for  the  transistor  width,  with  no  regard  for  the  input. 

The  first  success  came  when  the  population  was  seeded  with  an  approgripyriiog  to  the 
solution.  This  approximation  was  derived  by  a curve-  fitting  program  using  the  training 
data.  When  seeded,  the  GA  was  able  to  quickly  improve  the  approximation  and  find  a 
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network  that  gave  the  desired  output.  While  seeding  did  indeed  allow  an  answer  to  be 
found,  it  was  desired  that  the  GA  could  find  an  answer  using  an  initial  random  population. 

Better  success  was  found  using  a penalty  function.  Penalty  functions  decrease  the 
fitness  of  a chromosome  by  adding  constrictions  to  the  solution.  The  penalty  subtracted 
from  the  fitness  of  a chromosome  dependent  upon  how  close  the  output  of  two  consecutive 
data  points  were.  The  closer  the  two  outputs  for  the  two  points  were,  the  larger  the 
penalty.  With  the  addition  of  this  penalty  function,  the  GA  was  able  to  find  a solution  for 
the  network  given  an  initial  random  solution. 


4 Discussion 

The  initial  results  were  very  surprising.  The  inability  of  the  GA  to  find  an  appropriate 
solution  meant  that  either  the  network  could  not  solve  the  problem,  or  that  the  real 
solution  to  the  problem  was  extremely  difficult  to  find.  Previous  work  by  Thelan  showed 
that  indeed  a solution  to  this  problem  did  exist.  This  meant  that  the  real  solution  must 
be  difficult  for  the  GA  to  find.  In  fact,  when  seeding  the  GA  with  approximate  solutions, 
an  answer  was  found. 

There  are  three  ways  to  make  a problem  difficult  for  a GA  to  solve.  Either  the  solution 
space  is  extremely  convoluted,  the  best  solution  occupies  a very  small  portion  of  the 
solution  space,  or  the  solution  space  is  misleading  to  a GA.  Since  proving  whether  a GA 
is  being  mislead  is  very  difficult,  the  other  two  possibilities  where  considered.  Comparing 
the  solutions  found  by  the  GA  showed  that  they  converged  to  the  same  answer  each  time. 
Thus,  the  solution  space  was  not  too  convoluted  for  the  GA  to  search. 

The  correct  solution  was  not  found  with  an  initial  random  population.  However,  it  was 
found  with  the  insertion  of  the  penalty  function.  (The  effect  of  the  penalty  function  was 
to  place  a pole  in  the  middle  of  the  unwanted  solution,  thus  allowing  the  GA  to  continue 
searching  the  space,  and  find  the  correct  solution.)  This  leads  the  authors  to  believe  that 
the  right  answer  occupied  a very  small  portion  of  the  solution  space,  allowing  the  GA  to 
more  easily  find  the  undesired  answer. 

This  example  points  out  one  common  problem  with  GAs.  In  using  GAs,  often  the  solu- 
tion space  is  not  very  well  known,  and  suboptimal  answers  can  often  dominate  the  solution 
space.  Indeed,  if  the  problem  to  be  solved  is  incorrectly  or  incompletely  represented,  the 
GA  will  take  advantage  of  these  mistakes,  and  produce  wrong  answers. 


5 Conclusion 

It  has  been  shown  that  product  networks  can  be  successfully  trained  with  Genetic  Al- 
gorithms. A product  network  has  been  trained  to  give  the  width  of  CMOS  switch,  given 
power  supply  voltage,  temperature  and  minimum  conductance  specifications  for  the  switch. 
Further  research  will  be  done  to  compare  the  use  of  GAs  to  backpropagation  in  product 
networks.  Also,  the  capabilities  of  product  networks  will  be  compared  to  traditional  neu- 
ral networks.  While  product  units  have  been  shown  to  have  superior  capabilities  over 
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traditional  summation  units,  almost  no  studies  to  compare  different  networks  have  been 
done. 
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Abstract-  Neural  networks  tend  to  fall  into  two  general  categories,  1)  software 
simulations,  or  2)  custom  hardware  that  must  be  trained.  The  scope  of  this 
project  is  the  merger  of  these  two  classifications  into  a system  whereby  a 
software  model  of  a network  is  trained  to  perform  a specific  task  and  the 
results  used  to  synthesize  a standard  cell  realization  of  the  network  using 
automated  tools. 


1 Introduction 

Neural  net  research  may  be  roughly  classified  into  two  general  categories;  software 
simulations  or  programmable  neural  hardware  [2,6]. 

Many  neural  network  simulators  are  readily  available.  The  major  drawback  to  all  of 
them  is  that,  no  matter  how  well  written,  they  are  run  on  a sequential  machine.  This  means 
that  the  software  must  simulate  the  parallelism  of  the  network  and  slows  down  dramatically 
as  the  number  of  connections  increases  [1,3]. 

Hardware  neural  networks  are  usually  general  purpose  and  must  be  trained. 
Depending  on  the  training,  a significant  percentage  of  the  total  hardware  resources  may  be 
unused.  By  defining  the  network  with  a software  model  and  then  synthesizing  the  network 
from  that  model,  all  of  the  silicon  area  will  be  utilized.  This  should  result  in  a significant 
reduction  in  die  size  when  comparing  the  application  specific  version  to  a general  purpose 
neural  network  capable  of  being  trained  to  perform  the  same  task. 
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2 Network  Modeling 


The  simulator  that  is  being  used  for  this  project  is  version  2.01  of  NETS  written  by  Paul  T. 
Baffes  of  the  Software  Technology  Branch  of  the  Lyndon  B.  Johnson  Space  Center  [3].  This 
simulator  was  chosen  for  several  reasons.  NETS  has  a flexible  network  description  format, 
the  source  code  is  available,  and  the  weight  matrix  may  be  stored  in  an  ASCII  file  for  easy 
use  in  later  steps. 

As  a first  design  effort,  a simple  numeral  recognition  network  with  three  layers  and 
37  neurons  was  defined.  The  network  consists  of  a 5 by  6 input  layer,  one  hidden  layer  that 
is  also  5 by  6,  and  a 1 by  7 output  layer.  This  network  is  fully  connected.  Figure  1 contains 
the  NETS  description  of  the  network. 


LAYER  : 0 -INPUT  LAYER 

NODES  : 30 

X-DIMENSION  : 5 
Y-DIMENSION  : 6 
TARGET : 2 


LAYER  : 1 -OUTPUT  LAYER  -----  -v. 

NODES : 7 

X-DIMENSION  : 1 
Y-DIMENSION  : 7 

LAYER  : 2 -FIRST  HIDDEN  LAYER 

NODES  : 30 

X-DIMENSION  : 5 
Y-DIMENSION  : 6 
TARGET : 1 

NETS  description  of  neural  network. 


Figure  1, 


A training  set  consisting  of  ten  digits ^0  through  9)  and  the  corresponding  ASCII  values  is 
used  to  build  The  network  weighting  matrix.  Figure  2 illustrates  a typical  character 
representation  and  its  corresponding  input/output  vector.  The  network  training  set  does  not 
include  any  noisy  or  corrupted  data  to  simplify  the  model.  Training  the  network  required  100 
iterations  and  was  completed  in  about  6 minutes.  The  fully  connected  network  has  1110 
connections.  The  weights  in  the  weight  matrix  range  between  ±1.7  following  training. 


lifitiih  lilt  .1:1. 1 - II  1.1  I m I II  II  . i III,  I , ill  JJJiIIIUI 


3rd  NASA  Symposium  on  VLSI  Design  1991 


5.3.3 


Once  the  network  is  trained,  the  number  of  connections  is  reduced.  This  is  done  by  setting 
all  weights  having  an 
absolute  value  less  than 
a specified  value  to  zero 
(no  connect).  This 
process  is  easily 
automated  allowing 
various  cut  off  values  to 
be  evaluated.  The 
modified  weight  matrix 
is  evaluated  using 
NETS  to  determine 
whether  or  not  the 
network  will  still 
satisfactorily  perform  its 

designed  task.  Table  1 is  a summarizes  the  results  of  reducing  the  network. 


Oil  1000b 

Character 
w/ASCII  code 


( .1  .9  .9 .9  .1 
.9  .1  .1  .1  .9 
.1  .9  .9 .9 .1 
.9  .1  .1  .1  .9 
.9  .1  .1  .1  .9 
.1  .9  .9 .9 .1 
.1  .9  .9  .9  .1  .1  .1) 

Input/oirtput  vector 


Example  of  training  set  element. 

Figure  2. 


CUT-OFF 

NUMBER  OF 

SATISFACTORY 

THRESHOLD1 

VALUE 

CONNECTIONS 

PERFORMANCE 

0.3 

688 

yes 

0.5 

0.4 

530 

yes 

0.5 

0.5 

413 

yes 

0.5 

0.55 

344 

yes 

0.5 

0.6 

288 

no 

1 Any  value  > threshold  is  a "one"  otherwise  "zero". 


Table  1. 


The  actual  cut-off  values  tested  ranged  up  to  1,  however,  all  results  with  a cut-off  above  0.55 
were  inconsistent  with  the  desired  results.  Figure  3 is  the  test  vector  for  the  character  shown 
in  figure  2 with  its  associated  output  vector.  (The  cut-off  is  0.55  and  the  threshold  is  0.5.) 
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— test  set  for  min.net 

(.1  .9  .9  .9  .1—  ”8" 

.9  .1  .1  .1  .9 
.1  .9  .9  .9  .1 
.9  .1  .1  .1  .9 
.9  .1.1  .1.9 

.1.9 .9 .9  .1)  : ;;  “ 

Outputs  for  Input  8: 

( 0.002  0.846  0.994  0.865  0.036  0.257  0.164) 

Output  vector  for  test  vector  from  figure  2. 

Figure  3. 

With  the  threshold  value  taken  into  consideration  the  output  is  Oil  1000,  which  is  the  ASCII 
code  for  "8". 


3 Logic  Synthesis—  — 


The  intent  of  the  neural  network  synthesis  process  is  to  provide  a fully  automatic  path  to 
silicon  realization  once  a network  model  has  been  constructed  and  verified  in  the  NETS 
environment.  The  entire  synthesis  process  is  schematically  shown  in  figure  4.  The  OCT  tool 
set  from  the  University  of  California,  Berkeley  [8],  was  chosen  for  the  back  end  of  this 
procedure,  which  includes  logic  optimization,  technology  mapping,  standard-cell  place-and- 
route,and  composite  artwork  assembly  and  verification. 
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1 

yes  1 

4 

Perform  DRC  and 

stion 


Application  specific  neural  network  synthesis  process. 

Figure  4. 


First,  the  completed  neural  network  topology  is  translated  from  the  NETS  environment  to 
the  OCT  hardware  description  language  BBS  by  a NETS-to  OCT  program  written  for  this 
purpose.  A simple  example  of  a single  neuron  in  NETS  netlist  and  the  corresponding  BDS 
description  are  shown  in  figure  5.  The  BDS  file  is  then  compiled  into  unminimized  logic 
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functions  by  the  OCT  tool  Bdsyn.  These  are  mapped  into  a standard  cell  library  by  MisII. 
Currently,  the  SCMOS2.2  standard-cell  library  from  Mississippi  State  University  is  used, 
implemented  in  the  SCMOS8  N-Well  CMOS  process  available  from  the  National  Science 
Foundation  MOSIS  program.  This  process  has  a minimum  feature  size  of  two  microns. 


LAYER  : 0-INPUT  LAYER 
NODES : 5 
TARGET : 1 


LAYER  : 1-OUTPUT  LAYER 
NODES : 1 


Majority  logic  NETS  description. 


MODEL  dumb 

out<b>,sum0<4:6>=in<4:0>; 

ROUTINE  dumbnet; 

1 target  layer  t 6 node  # 0 

sum0<4:0>  = 8 

+ m0<0> 

+ inO<l> 

+ in0<2> 

+ in0<3> 

- in0<4> 

IF  sum0<4>  EQL  1 

THEN  out<0>  = 1 
ELSE  out<0>  = 0; 

ENDROUTINE; 

ENDMODEL; 

Majority  logic  BDS  description. 

Figure  5. 

MisII  is  an  n-level  logic  optimizer,  which  creates  a realization  of  a logic  function  from  a given 
cell  library  minimizing  both  worst-case  propagation  delay  and  the  number  of  cells  required. 
The  relative  priority  of  area  versus  speed  is  user  selectable.  The  result  is  stored  in  the  OCT 
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database  and  may  be  verified  with  MUSA,  a multilevel  simulator  in  the  OCT  suite.  From 
here,  the  design  process  may  be  easily  iterated  from  the  NETS  description  forward  as  shown 
in  figure  4. 

A number  of  additional  OCT  tools  are  available  for  padring  composition,  composite 
placement  and  channel  routing,  power  distribution  routing,  and  artwork  verification. 
Artwork  may  be  generated  from  the  OCT  database  in  Caltech  Intermediate  Format  (CIF) 
for  release  to  MOSIS  or  other  foundry  services. 

The  standard-cell  realization  of  the  digit  recognizer  described  previously  is  shown  in 
figure  6.  Its  37  neurons  required  2741  standard  cells  in  47  square  millimeters. 


Standard  cell  realization  of  character  recognizer. 

Figure  6. 
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4 Conclusions  and  Future  Directions 

Figure  7 shows  a block  diagram  of  a 5 input  programmable  neuron.  To  build  the  digit 
recognizer  using  this  generic  neuron  would  require  about  45  neurons.  The  actual  network 
has  37  neurons.  The  increased  number  of  generic  neurons  is  due  to  the  five  input  limitation. 
Many  of  the  nodes  in  the  network  have  more  than  five  inputs.  With  the  generic  neurons, 
multiple  neural  cells  would  be  connected  at  the  outputs  giving  the  behavior  characteristics 
of  a neuron  having  a larger  number  of  inputs.  The  number  of  standard  cells  required  for  the 
entire  network  realized  with  the  generic  5 input  neuron  is  approximately  8280  (~45  neurons 
by  184  standard  cells  per  neuron  [7]).  This  network  would  cover  nearly  141  square 
millimeters. 


Synaptic  Weight  — 
flip  flops  


T T T T ? — 


I □ 

Threshold 
flip  flops 


Output 
flip  flop 


Generic  five  input  neural  cell 
Figure  7. 


As  stated  previously,  the  network  created  with  the  methodology  described  here  requires  2741 
standard  cells  and  47  square  millimeters.  This  represents  a 66%  reduction  in  the  number 
of  cells  used  and  silicon  area.  This  reduction  will  allow  the  chip  to  be  fabricated  at  a 
significantly  lower  cost  than  a chip  with  a sufficient  number  of  the  generic  neurons. 
Furthermore,  all  or  the  silicon  area  in  the  application  specific  area  is  utilized  whereas,  a 
significant  percentage  is  unused  in  the  general  model.  These  results  are  very  preliminary. 
Experiments  with  simpler  models  suggest  that  substantial  improvements  in  standard  cell 
optimization  remain  possible.  — - — 

The  models  used  in  this  research  were  trained  using  ideal  training  sets,  meaning  that 
the  characters  were  well  formed  and  the  level  of  contrast  between  the  background  and  the 
characters  was  high.  For  the  neural  network  to  have  any  real  value,  a larger  training  set 
would  be  necessary.  This  set  would  have  both  poorly  formed  and  low  contrast  examples  of 
each  character.  Using  a training  set  of  this  type  would  cause  an  increase  in  the  number  of 
connections  necessary  in  the  network  [51. 
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The  synthesis  process  described  may  be  used  to  deliver  an  application  specific  neural 
network,  trained  to  perform  a specific  task  at  less  cost  than  utilizing  general  neural 
hardware.  Silicon  area  will  be  more  highly  utilized  in  the  application  specific  case  since  only 
the  necessary  circuitry  is  fabricated.  Although  more  research  is  necessary,  early  results  show 
the  method  to  be  promising. 
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Abstract-  This  paper  describes  experimental  results  obtained  with  the  use  of 
measurement  reduction  for  statistical  IC  fault  diagnosis.  The  reduction  method 
used  involves  data  pre-processing  in  a fashion  consistent  with  a specific  defini- 
tion of  parametric  faults.  The  effects  of  this  preprocessing  are  examined. 


1 Introduction 

An  integrated  circuit  test  is  specified  by  a combination  of  input  and  output  signals  which 
characterizes  some  attribute  of  ideal  circuit  function.  The  presence  of  faults  in  a fabri- 
cated circuit  will  cause  observed  output  signals  to  deviate  from  the  simulated  ideal.  A 
fault  diagnostic  is  a decision  rule  combining  what  is  known  about  an  ideal  circuit  test 
response  with  information  about  how  the  response  is  distorted  by  fabrication  variations 
and  measurement  noise.  The  rule  is  used  to  detect  fault  existence  in  fabricated  circuits 
using  real  test  equipment. 

The  IC  failure  diagnosis  problem  can  be  viewed  as  a statistical  pattern  recognition 
problem.  Instead  of  extracting  output  response  parameters  explicitly  and  comparing  with 
the  specification,  the  output  responses  can  be  identified  into  faulty  or  non-faulty  according 
to  some  classification  decision  rules.  It  has  been  positively  demonstrated  that  pattern 
classification  technique  can  be  used  in  IC  diagnosis  [Mea90]. 

Recent  experiments  [Mea91]  have  showed  that  feedforward  network  classifier  (PFN) 
generally  perform  as  well  as  or  even  better  either  than  the  traditional  statistical  parametric 
classifier,  Gaussian  Maximum  Likelihood  Classifier  (GML)  or  the  non-parametric  classifier, 
the  K-nearest  Neighbors  classifier  (KNN).  However,  it  usually  needs  more  computational 
efforts  for  FFN  in  the  training  phase  to  establish  the  discriminant  function.  To  be  more 
effective,  there  is  a need  to  find  ways  to  consistently  reduce  this  training  overhead,  while 
simultaneously  retaining  prediction  accuracy. 

Nevertheless,  performance  of  a classifier  depends  on  the  data  presented  in  the  training, 
the  discriminant  function  established  in  the  training  phase  as  well  as  the  classification 
algorithm  of  the  classifier.  To  ensure  high  performance  accuracy,  essential  information  has 
to  be  presented  in  the  training  data  for  the  establishment  of  the  discriminant  function  of 
the  classifier. 


1This  work  supported  by  NSF-UIC  CDADIC  Project  90-1. 
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In  IC  diagnosis,  determination  of  a circuit  fault  is  to  classify  the  circuit  from  the 
input-output  measurement  according  to  a decision  rule  which  is  built  upon  the  estimated 
prior  probability  distribution  in  the  performance  space  of  the  circuit.  Judgment  is  made 
according  to  the  decision  rule  of  the  classifier  established  in  training,  which  defines  the 
decision  boundaries  for  the  classification.  For  accurate  classification,  decision  boundaries 
of  the  classifier  has  to  be  coincided  with  or  close  to  the  performance  specification  criteria 
or  boundaries.  Such  decision  boundaries  has  to  be  captured  by  the  classifier  to  set  up  the 
discriminant  function  or  decision  rules  in  training.  The  acceptance  region  of  the  fabricated 
circuits  lines  between  the  upper  and  lower  performance  specification  limits.  For  highest 
accuracy,  in  the  training  phase,  the  decision  boundaries  have  to  be  built  around  the  spec- 
ification transition  space.  In  this  case,  they  are  the  upper  and  lower  specification  limits 
instead  of  the  mean  of  the  performance  distribution. 

It  is  a well  known  fact  that^  in  back-propagation  training  algorithm,  input  values  are 
multiplied  by  the  derivative  of  the  logistic  function,  such  that,  a window  is  placed  on  the 
current  estimated  decision  boundaries,  not  the  mean  [Lip88].  This  character  is  very  impor- 
tant in  IC  diagnostic  problem  especially  the  go/no  go  testing.  If  the  discriminant  function 
is  established  around  the  specification  boundaries,  it  will  improve  the  performance  of  the 
classifier.  Besides,  training  on  these  boundaries,  it  will  improve  the  training  computational 
load. 

To  improve  the  training  effort,  it  is  therefore  logical  to  train  a FFN  based  on  the 
decision  boundaries.  If  data  used  in  the  training  is  collected  around  these  boundaries, 
the  discriminant  function  computed  by  the  trained  network  will  be  more  accurate  in  these 
regions.  It  will  improve  the  training  computational  load  as  well,  since  fewer  epochs  are 
required  to  converge  to  a given  accuracy.  This  paper  reports  on  experiments  conducted  to 
help  verify  this  idea. 

2 Data  Reduction  Method:  Boundary  Band  Data  Pre- 
processing 

In  contrast  to  the  design  task,  the  concern  of  IC  fault  diagnosis  is  mainly  on  whether  the 
circuit  performance  fall  within  the  acceptance  region  instead  of  the  performance  mean.  In 
other  words,  the  specification  transition  boundaries  are  the  most  concerns  in  IC  diagnosis. 
If  the  decision  rules  or  decision  boundaries  of  any  diagnosis  algorithm  are  based  on  these 
transition  boundaries,  it  is  reasonable  to  expect  a highly  accurate  and  effective  diagnostic 
capability. 

As  discussed  in  the  preceding  section,  there  is  a need  to  improve  the  computational  load 
of  training  FFN  classifier  even  though  it  has  a better  diagnostic  capability  than  the  other 
traditional  statistical  classifiers.  Here,  we  proposed  a Boundary  Band  Data  (BBD)  training 
method  for  FFN  training  to  improve  the  computational  load  in  the  training  phase.  The 
essence  of  the  proposed  method  is  based  on  the  characteristic  of  FFN.  In  back-propagation 
training  algorithm  of  FFN,  input  values  are  multiplied  by  the  derivative  of  the  logistic 
function,  such  that,  a window  is  placed  on  the  current  estimated  decision  boundaries,  not 
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Figure  1:  Operational  amplifier  circuit  diagram 

the  mean.  If  the  discriminant  function  or  decision  boundary  of  the  trained  network  is  set 
up  around  the  specification  transition  boundaries,  it  will  improve  the  performance  of  the 
classifier.  Besides,  training  on  these  boundaries,  it  will  improve  the  training  computational 
load.  Making  use  of  this  distinctive  characteristic  of  FFN,  the  proposed  Boundary  Band 
Data  training  method  is  to  train  a FFN  with  those  data  gather  from  the  proximity  of 
performance  specification  transition  boundaries. 

3 Experiment  Set  Up 

To  investigate  the  feasibility  of  using  BBD  in  FFN  training,  experiments  were  conducted 
in  this  study.  The  transient  response  and  frequency  response  of  the  operational  amplifier 
shown  in  Figure  1 were  used  for  the  experiments.  For  frequency  response  experiments,  the 
open  loop  frequency  response  of  the  operational  amplifier  were  used.  For  transient  response 
experiments,  the  step  response  of  an  inverting  amplifier  with  the  same  operational  amplifier 
for  the  frequency  response  experiments  are  used.  The  circuit  configuration  for  the  transient 
response  experiment  is  shown  in  Figure  2. 

4 Fault  Definition 

All  experiments  were  designed  to  detect  parametric  faults  in  an  operational  amplifier. 
Monte  Carlo  simulation  of  MOSFET  model  parameters  was  used.  Only  those  statistical 
independent  model  parameters  were  used  so  that  the  correlation  effect  among  model  pa- 
rameters was  eliminated.  In  each  of  the  experiments,  circuit  fault  was  defined  as  a large 
variation  in  one  of  the  independent  model  parameter.  In  our  experiments,  three  types  of 
parametric  faults  were  used.  They  were  variations  in  MOSFET  oxide  thickness  (of  all  the 
transistor  in  the  circuit),  zero  bias  threshold  voltage,  and  junction  depth. 
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Figure  2:  Inverting  amplifier  circuit  configurarion 

Monte  Carlo  simulation  using  SPICE  with  a large  variation  around  the  mean  value 
in  the  chosen  model  parameter  was  used.  With  a pre-selected  performance  criteria,  the 
appropriate  upper  and  lower  limits  of  the  model  parameter  which  defined  the  fault  and 
normal  transition  boundaries  could  be  determined.  For  example,  we  were  interested  to 
define  the  fault  and  normal  boundaries  for  the  experiment  with  the  inverting  amplifier 
circuits.  The  fault  was  chosen  to  be  variation  in  oxide  thickness.  So  the  SPICE  model 
parameter,  tox,  was  chosen  with  mean  value  equal  to  600A.  The  related  circuit  performance 
criteria  were  the  step  response  overshoot  and  the  slope  of  the  stem  response.  Using  the 
above  method,  the  transition  boundaries  were  set  at  400 A and  800 A. The  acceptance  region 
was  set  between  these  limits.  Any  circuit  fell  within  this  region  was  defined  to  be  normal; 
otherwise,  it  was  defined  as  faulty. 

In  this  study,  three  types  of  parametric  faults  were  studied.  They  were  circuit  faults 
in  oxide  thickness,  junction  depth,  and  zero  biased  threshold  voltage.  For  each  of  the 
experiments,  there  was  only  a single  parametric  fault  existing  in  the  circuit.  The  mean 
model  parameter  values  for  normal  circuit  and  transition  boundaries  for  circuit  faults  are 
listed  in  Table  1. 


5 Experiment  Description 

Eight  experiments  which  were  divided  into  two  categories  were  investigated  for  the  BBD 
training  methods.  For  the  first  category,  it  consisted  of  six  experiments.  To  simplify  the 
problem,  in  each  of  the  experiments,  only  one  SPICE  model  parameter  was  allowed  to 
alter.  It  was  under  the  assumption  that  there  was  no  process  existing  in  JC  fabrication 
except  the  process  fault.  Even  though  such  assumption  might  not  be  realistic  for  actual  IC 
fabrication,  the  goal  of  these  non-noisy  experiments  was  to  study  the  effect  of  the  proposed 
BBD  training  method  under  the  ideal  condition.  In  these  non-noisy  experiments,  the 
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Experiment 

Parametric  Fault 

Performance  parameters 

Mean 

Upper  Limit 

Lower  Limit 

i 

oxide  thickness 

slew  rate  and  overshoot 

600A 

800A 

400A 

2 

junction  depth 

slew  rate 

0.4/jm 

0.6/xm 

0.2/xm 

3 

threshold  voltage 

slew  rate 

|0.7V| 

|0.5V| 

|0.9V| 

4 

oxide  thickness 

slew  rate 

600A 

800A 

400A 

5 

junction  depth 

slew  rate 

0.4/zm 

0.6/im 

0.2pm  | 

6 

threshold  voltage 

slew  rate 

|0.7V| 

|0.5V| 

|0.9V| 

7 

oxide  thickness 

slew  rate  and  overshoot 

600A 

800A 

400A 

8 

junction  depth 

slew  rate 

OAfim 

0.6/im 

0.2/zm 

Table  1:  SPICE  model  parameter  mean  and  transition  boundary  values 

transient  response  of  the  amplifier  in  a closed  loop  inverting  circuit  under  various  nominal 
and  faulty  conditions  was  used  to  develop  the  experiment  database  for  the  experiment  1 
to  3.  For  experiments  4 to  6,  open  loop  frequency  response  of  the  amplifier  under  the 
same  nominal  and  faulty  conditions  as  in  experiments  1 to  3 were  used.  The  six  non-noisy 
experiments  were: 

Exp.l:  Detect  a 33%  variation  in  oxide  thickness  of  all  the  transistor  in  the  circuit  by 
observing  the  circuit  open-loop  frequency  response. 

Exp. 2:  Detect  a 50%  variation  in  junction  depth  of  all  the  transistor  in  the  circuit  by 
observing  the  circuit  open-loop  frequency  response. 

Exp.  3:  Detect  a 30%  variation  in  threshold  voltage  of  all  the  transistor  in  the  circuit  by 
observing  the  circuit  open-loop  frequency  response. 

Exp. 4:  Detect  a 33%  variation  in  oxide  thickness  of  all  the  transistor  in  the  circuit  by 
observing  the  circuit  time-domain  step  response. 

Exp. 5:  Detect  a 50%  variation  in  junction  depth  of  all  the  transistor  in  the  circuit  by 
observing  the  circuit  time-domain  step  response. 

Exp. 6:  Detect  a 30%  variation  in  threshold  voltage  of  all  the  transistor  in  the  circuit  by 
observing  the  circuit  time-domain  step  response. 

The  second  categories  of  the  experiments  consisted  of  two  experiments  which  were 
similar  to  experiment  4 and  5 with  the  difference  that  there  were  process  noise  existed. 
It  was  under  the  assumption  that  there  were  process  noises  in  the  fabrication  but  not 
contributed  to  circuit  faults.  Such  assumption  was  more  realistic  for  actual  IC  fabrication. 
The  goal  of  these  experiments  was  to  study  the  effect  of  BBD  training  of  FFN  under  the 
non-ideal  environment.  Those  process  noises  were  generated  by  varying  those  statistical 
independent  model  parameters  [She88]  of  lateral  diffusion  (LD),  substrate  doping  density 
(NSUB),  bulk  threshold  parameter  (gamma),  and  channel-length  modulation  (lambda) 


5.4.6 


at  most  one  percent.  In  these  two  experiments,  the  transient  response  of  the  amplifier 
in  a closed  loop  inverting  circuit  under  various  nominal  and  faulty  conditions  was  used 
to  develop  the  experiment  database.  For  these  noisy  experiments,  only  one  parametric 
fault  were  assumed  but  accompanied  with  all  the  process  noises  listed  above.  The  noisy 
experiments  were:  — 

Exp.  7:  Detect  a 33%  variation  in  oxide  thickness  of  fill  the  transistor  with  process  noise  in 
the  circuit  by  observing  the  circuit  time- domain  step  response,  v 

Exp.  8:  Detect  a 50%  variation  in  junction  depth  of  all  the  transistor  with  process  noise  in 
the  circuit  by  observing  the  circuit  time-domain  step  response. 

6 Data  Generation 

In  each  of  the  experiments,  two  data  distributions  namely  normally  distributed  data  and 
boundary  band  data  as  shown  in  Figure  3 and  4,  were  used  to  build  up  the  experiment 
database.  120  simulated  responses  were  obtained  via  a Monte  Carlo  simulation  for  each 
data  distribution.  Data  for  the  boundary  band  distribution  were  generated  around  the 
transition  boundaries.  The  sample  data  distribution  of  each  of  the  experiment  is  similar 
to  Figure  3 and  4 with  difference  in  variation  percentage  of  the  corresponding  model 
parameter.  And  the  corresponding  circuit  p erformance  dist ribut  ion  from  the  two  data 
distributions  were  showed  in  Figure  5 and  6.  60  of  the  responses  correspond  to  the  fault 
Iree  condition  and  60  correspond  to  the  faulty  condition.  In  other  words,  there  were  four 
set  of  data  consisting  the  experimental  database  for  each  of  the  experiments.  The  data  sets 
were  data  for  faulty  circuit  with  normal  distribution,  data  for  normal  circuit  with  normal 
distribution,  data  for  faulty  circuit  with  boundary  band  distribution,  and  data  for  normal 
circuit  with  boundary  band  distribution.  For  a particular  data  distribution,  30  responses 
from  each  class  (normal/ faulty)  were  used  for  classifier  training.  After  training,  classifier 
were  tested  on  the  unseen  data  from  the  trained  data  distribution  as  well  as  the  data  from 
the  other  type  of  distribution. 

7 Classifier  Training 

As  mentioned  in  the  introduction,  the  objective  of  this  study  is  to  contrast  the  effectiveness 
of  a feedforward  network  classifier  trained  on  boundary  band  data  against  that  of  tradi- 
tional statistical  classifiers  trained  on  normally  distributed  data  and  feedforward  network 
as  well  in  the  context  of  IC  fault  diagnosis.  Classifiers  used  in  this  study  were  Gaussian 
Maximum  Likelihood  Classifier,  K-Nearest  Neighbor  Classifier  and  Feedforward  Classifier. 

Thirty  patterns  chosen  from  each  the  normal  and  faulty  class  of  each  of  the  experimental 
database  for  the  training.  For  GML,  training  data  was  used  to  build  the  corresponding 
mean  matrix,  covariance  matrix  and  the  inverse  of  covariance  matrix.  For  KNN,  training 
data  was  used  as  the  base  for  the  classifier.  For  FFN,  different  types  of  training  were  used 
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Figure  3:  Data  with  normal  distribution  Figure  4:  Data  with  boundary  nand  dis- 
tribution 


to  establish  the  discriminant  function  for  the  corresponding  FFN.  There  were  FFN  trained 
with  non-noisy  normal  distributed  data,  non-noisy  boundary  band  data,  noisy  normal 
distributed  data,  and  noisy  boundary  band  data.  For  each  trained  FFN,  only  one  of  the 
listed  training  method  was  used.  Unlike  traditional  statistical  classifiers,  there  are  some 
training  criteria  can  be  chosen.  We  trained  our  FFN  based  on  the  total  sum  of  square 
error  of  all  the  training  data  for  a particular  type  of  training  or  up  to  a preset  training 

epoch  limit. 


8 Classifier  Computational  Load  Calculation 

The  performance  of  each  classifier  was  not  only  measured  in  terms  of  predictive  accuracy  on 
previously  unseen  data,  but  also  the  number  of  floating  point  operations  (FLOPS)  required 
to  construct  the  classifier,  and  the  number  of  FLOPS  required  to  perform  a diagnostic 
classification.  Number  of  FLOPS  computed  for  each  of  the  classifier  of  each  experiment 
is  based  on  the  implementation  algorithm.  It  is  not  the  actual  computer  operation.  Since 
different  software  packages  are  used  in  the  implementation  of  the  classifier,  it  is  not  accurate 
if  they  are  compared  based  on  the  real  CPU  time.  In  comparing  computation  requirement 
in  testing  , number  of  flops  required  per  pattern  are  calculated  with  the  equations  2 y{n  + 
1)  + 2 z(y  + 1)  for  Feedforward  network,  mn(3  + 2n)  for  Gaussian  Maximum  Likelihood 
classifier,  Stnnp  for  K-Nearest  Neighbor  classifier,  (m:  no.  of  class,  m no.  of  measurement 
for  each  pattern,  y.  no.  of  hidden  unit,  z:  no.  of  output  unit,  p : no.  of  pattern  for  the 
training  set  in  each  class) 


9 Experiment  Results 

In  each  of  the  experiments,  the  performance  of  the  classifiers  were  evaluated  for  the  pre- 
diction accuracy  of  unseen  data  as  well  as  the  training  and  testing  computational  load. 
The  results  of  non-noisy  experiments  1 to  6 are  summarized  in  Table  2 and  Table  3.  The 
results  of  noisy  experiments  7 and  8 are  summarized  in  Table  4. 
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Classifier 

Exp.  1 

Exp.  2 

Exp.  3 

Exp.  4 

Exp.  5 

Exp.  6 

Accuracy  (unseen  data  - %) 
FFN(Norm) 

82.2 

88.8 

97.7 

88 

95.5 

98.8 

FFN(Boun) 

99.4 

97.2 

98.8 

97.2 

98.8 

96.6 

FFN(ReBo) 

98.8 

97.7 

100 

98.3 

99.4 

97.2 

GML 

N.W 

N.W 

N.W 

50 

N.W 

N.W  | 

INN 

85 

96 

98.3 

50 

96 

100 

3NN 

81.6 

100 

98.3 

53.2 

100 

100 

5NN 

80 

96 

98.3 

50 

96 

100 

Setup  FLOPs  (total) 
FFN(Norm) 

3e8 

3e8 

3e8 

2.7e8 

2.7e8 

2.7e8 

FFN(Boun) 

3e8 

3e8 

3e8 

2.7e8 

2.7e8 

2.7e8 

FFN(ReBo) 

6e7 

6e7 

3e7 

5.4e7 

5.4e7 

2.7e8 

GML 

N.A 

N.A 

N.A 

5e4 

N.A 

N.A 

KNN 

none 

Diagnostic  FLOPs  (per  pattern) 
FFN(Norm) 

1.5e3 

1.5e3 

1.5e3 

1.4e3 

1.4e3 

1.4e3 

FFN(Boun) 

1.5e3 

1.5e3 

1.5e3 

1.4e3 

1.4e3 

1.4e3 

FFN(Rebo) 

1.5e3 

1.5e3 

1.5e3 

1.4e3 

1.4e3 

1.4e3 

GML 

N.A 

N.A 

N.A 

5.6e3 

N.A 

N.A 

KNN 

1.5e4 

1.5c4 

1.5e4 

1.3e4 

1.3e4 

1.3e4 

N.A:  Not  applicable 

N.W:  Not  working  for  the  case  due  to  singular  covariance  matrix 
GML:  Gaussian  Maximum  Likelihood  Classifier 

KNN:  K-Nearest  Neighbors  Classifier  _ _ 

FFN(Norm):  Feedforward  network  trained  with  normal  distributed  data 
FFN(Boun):  Feedforward  network  trained  with  boundary  band  data 

FFN(ReBo):  Feedforward  network  trained  with  boundary  band  data  and  reduced  training  epoch 


Table  2:  Classifier  Accuracy  and  Computational  Overhead  of  Exp.  1 to  6 
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We  trained  three  FFN  for  each  of  the  experiments.  There  was  FFN  trained  with  normal 
distributed  data.  In  this  training,  the  FFN  (Norm)  was  trained  up  to  a point  that  the  tss 
did  not  changed  much  with  ongoing  training.  We  determined  the  training  stopping  point 
to  be  2e5  epochs.  FFN  (Boun)  was  a FFN  trained  with  the  boundary  band  data  around 
the  transition  values.  The  same  training  stopping  point  was  used  as  in  FFN(Norm).  The 
third  type  of  training  used  in  these  six  experiments  was  a FFN  trained  with  boundary  band 
data  but  with  fewer  training  epochs.  The  stopping  point  of  this  training  method  depended 
on  the  prediction  accuracy  of  the  trained  FFN.  We  stopped  the  training  whenever  the 
prediction  accuracy  of  the  trained  FFN  was  similar  to  the  FFN(Boun). 

From  the  results  of  the  experiments  1 to  6,  summarized  in  Table  2,  it  shown  that 
feedforward  networks  had  better  performance  than  GML  and  KNN.  Besides,  in  general, 
FFN  trained  with  boundary  band  data  had  better  prediction  accuracy  than  that  of  FFN 
trained  with  normally  distributed  data.  These  results  were  observed  as  predicted  in  the 
proposed  method  section.  It  was  because  the  decision  boundaries  of  the  trained  networks 
were  expected  to  set  around  the  transition  boundaries  in  the  performance  space.  Moreover, 
trained  with  fewer  epoch,  in  general,  it  had  a better  prediction  accuracy.  It  was  because  of 
the  nature  of  neural  network.  With  fewer  training,  it  might  eliminate  the  trained  network 
from  memorize  the  training  data.  In  most  of  the  cases,  inverse  covariance  matrix  of  the 
training  data  for  GML  could  not  be  computed  without  further  data  preprocessing. 

To  investigate  the  effectiveness  of  boundary  band  training,  different  types  of  training 
were  used  in  our  experiment.  There  were  FFN  trained  with  non-noisy  normal  distributed 
data,  non-noisy  boundary  band  data,  noisy  normal  distributed  data,  and  noisy  boundary 
band  data.  For  each  trained  FFN,  only  one  of  the  listed  training  method  was  used.  The 
results  of  FNN  trained  with  these  method  of  the  non-noisy  experiments  are  summarized 
in  Table  3.  Experiments  with  process  noises  are  summarized  in  Table  4. 

Results  shown  in  Table  3 and  4 were  the  prediction  accuracy  of  FFN  with  different 
training  methods.  Each  trained  FFN  was  tested  on  the  unseen  data  from  both  of  the 
normal  distributed  database  and  boundary  band  database.  As  showed  in  Table  3 and  4, 
there  were  two  prediction  accuracy  for  each  of  the  trained  FFN  which  tested  on  unseen 
data  from  normal  distributed  database  (labeled  Normal)  and  from  boundary  band  database 
(labeled  Boundary).  It  showed  that  the  boundary  band  training  did  work  on  both  non- 
noisy  and  noise  cases.  In  general,  with  fewer  training  epochs,  FFN  trained  with  boundary 
band  data  performed  as  well  as  and  even  better  in  some  case  than  FNN  trained  with 
normal  distributed  data.  And,  there  were  a large  training  epochs  and  prediction  trade 
off  of  FNN  trained  with  boundary  band  data.  Using  this  method,  there  was  very  few 
prediction  degradation  but  with  a significant  reduction  in  computation  load  spending  on 
training. 
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Frequency  Response 

Exp.  1 

Exp.  2 

Exp.  3 

Exp.  4 

Exp.  5 

Exp.  6 

Accuracy(un$een  data-%)  with  various  distribution 

FFN(Norm)  trained  with  normally  distributed  data 

Normal 

95 

96.6 

98.3 

96.6 

96.6 

98.3 

Boundary 

75.8 

85 

97.5 

85 

95 

99.1 

FFN(Boun)  trained  with  boundary  band  data 

Normal 

99.1 

98.3 

99.1 

96.6 

98.3 

96.6 

Boundary 

95 

99.1 

98.3 

100 

FFN(ReBo)  trained  with  boundary  band  data 

and  reduced  training  epoch 

Normal 

98.3 

97.5 

100 

98.3 

99.1 

97.5 

Boundary 

100 

98.3 

98.3 

100 

96.3 

||  Setup  Computation  Load  (Training) 

FFN(Norm)  trained  with  normally  distributed  data 

Epoch 

2e5 

2e5 

2e5 

2e5 

2e5 

2e5 

Flops 

3e8 

3e8 

3e8 

2.7e8 

2.7e8 

2.7e8 

Load 

100% 

100% 

100% 

100% 

100% 

100% 

FFN(Boun)  trained  with  boundary  band  data 

Epoch 

2e5 

2e5 

2e5 

2e5 

2e5 

2e5 

Flops 

3e8 

3e8 

3e8 

2.7e8 

2.7e8 

2.7e8 

Load 

100% 

100% 

100% 

100% 

100% 

100% 

FFN(ReBo)  trained  with  boundary  band  data 

and  reduced  training  epoch 

Epoch 

4e4 

4e4 

2e4 

4e4 

2e4 

2e4 

Flops 

6e7 

6e7 

3e7 

5.4e7 

2.7e7 

2.7e7 

Load 

20% 

20% 

10% 

20% 

10% 

10% 

Table  3: 


Comparison  of  Feedforward  Network  with  Different  Training  of  Exp.l  to  6 


mi  in  mu  mi  m i Mil  hi  1 1 IN 
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Transient  Response 

Exp.  7 

Exp.  8 

| Accuracy  (unseen  data-%)  with  various  distribution 

FFN(Norm)  trained  with  normally  distributed  data 
Normal 

98.3 

95 

Boundary 

94.1 

94.1 

FFN(Boun)  trained  with  boundary  band  data 
Normal 

97.5 

99.1 

Boundary 

98.3 

96.6 

FFN(ReBo)  trained  with  boundary  band  data 

and  reduced  training  epoch 

Normal 

99.1 

96.6 

Boundary 

98.3 

95 

Setup  Computation  Load  (Training) 

FFN(Norm)  trained  with  normally  distributed  data 

Epoch 

Flops 

Load 

2e5 

2.7e8 

100% 

2e5 

2.7e8 

100% 

FFN(Boun)  trained  with  boundary  band  data 

Epoch 

Flops 

Load 

2e5 

2.7e8 

100% 

2e5 

2.7e8 

100% 

FFN(ReBo)  trained  with  boundary  band  data 

Epoch 

Flops 

Load 

4e3 

5.4e6 

2% 

4e3 

5.4e6 

2% 

Table  4:  Comparison  of  Feedforward  Network  with  Different  Training  of  Noisy  Exp. 7 and 
8 


54.12 


10  Conclusion 

We  studied  the  effectiveness  of  a feedforward  network  classifier  trained  on  boundary  data 
against  that  of  traditional  statistical  classifiers  trained  on  normally  distributed  data  and 
feedforward  network  as  well  in  the  context  of  IC  fault  diagnosis.  Eight  experiments  with 
and  without  process  noises  were  conducted.  In  this  study,  experiment  results  once  again 
demonstrated,  m general,  that  feedforward  network  out  performed  the  traditional  statis- 
tical classifiers  namely  Gaussian  Maximum  Likelihood  classifier  and  K-Nearest  Neighbor 
classifier.  Feedforward  networks  trained  with  boundary  band  data,  it  reduced  the  training 
effort  with  only  little  prediction  degradation.  Experiment  results  showed  that  the  proposed 
boundary  band  data  did  improve  the  computational  load  needed  for  feedforward  network 
training. 
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Fuzzy  Control  of 
Magnetic  Bearings 

J.  J.  Feeley,  G.  M.  Niederauer,  and  D.  J.  Ahlstrom 
Department  of  Electrical  Engineering 
NASA  Space  Engineering  Research  Center  for  VLSI  Design 
University  of  Idaho 
Moscow,  Idaho  83843 

Abstract - This  paper  considers  the  use  of  an  adaptive  fuzzy  control  algorithm 
implemented  on  a VLSI  chip  for  the  control  of  a magnetic  bearing.  The  archi- 
tecture of  the  adaptive  fuzzy  controller  is  similar  to  that  of  a neural  network. 
The  performance  of  the  fuzzy  controller  is  compared  to  that  of  a conventional 
controller  by  computer  simulation. 


1 Introduction 

Magnetic  levitation  is  receiving  increasing  attention  as  a viable  alternative  to  conventional 
methods  of  moving  and  positioning  objects  [1],  NASA,  for  example,  has  developed  a 
cryogenic  cooler  that  uses  magnetic  bearings  and  actuators  exclusively  [2].  One  of  the 
more  difficult  aspects  of  the  application  of  magnetic  bearings  is  the  control  of  the  position 
of  the  shaft  in  the  bearing  housing.  Considerable  attention  has  been  given  to  this  problem 
recently.  Williams  et.  al.  [3]  reported  on  the  digital  control  of  active  magnetic  bearings  and 
showed  how  the  flexibility  of  digital  control  was  extremely  useful  in  implementing  a number 
of  control  algorithms  including  second-derivative  and  integral  feedback.  Chen  and  Darlow 
[4]  describe  an  analog  control  system  for  an  active  magnetic  bearing  that  uses  velocity  and 
acceleration  observers  to  improve  damping  and  cancel  imbalance  and  other  disturbance 
forces  to  greatly  improve  the  overall  system  performance.  Keith  et.  al.  [5]  discuss  the 
magnetic  support  of  flexible  shaft  at  speeds  up  to  14,000  RPM  using  a PC-based  digital 
controller  implementing  a proportional-derivative  control  algorithm.  A comparison  with  an 
earlier  analog  proportional-derivative  controller  is  also  made.  Chen  [6]  describes  an  active 
magnetic  bearing  control  scheme  using  three  parallel  feedback  loops  to  achieve  dynamic 
stiffness,  static  stiffness,  and  damping.  He  presents  a closed-form  solution  for  controller 
parameters  in  terms  of  desired  stiffness  and  damping.  Humphris  et.  al.  [7]  present  a 
comprehensive  treatment  of  the  active  magnetic  bearing  control  problem  and  compare  the 
relative  performance  of  low  bandwidth  and  high  bandwidth  controllers.  Scudiere  et.  al.  [8] 
used  a Texas  Instruments  TMS32010  digital  signal  processor  to  implement  a proportional- 
integral- derivative  control  algorithm  to  successfully  control  the  position  of  a number  of 
small  spheres  and  rotors.  Feeley  et.  al.  [9]  described  root  locus  design  of  a double  lead-lag 
controller  mapped  into  an  equivalent  digital  controller  via  the  Tustin  transformation.  The 
resulting  algorithm  has  been  implemented  on  an  Intel  80KC196C  microprocessor  and  used 
to  control  an  analog  computer  model  of  the  NASA  magnetic  bearing. 

The  difficulty  of  the  control  problem  stems  from  two  basic  causes.  The  first  is  due 


6.1.2 


to  the  physical  nature  of  the  magnetic  bearing  system  itself.  As  shown  in  Section  2,  the 
uncontrolled  magnetic  bearing  system  is  unstable,  uncertain,  and  highly  nonlinear.  The 
instability  is  due  to  the  relentlessness  of  gravity  in  causing  any  suspended  object  to  fall. 
The  uncertainties  arise  from  the  difficulties  in  modeling  viscous  friction,  eddy  currents, 
leakage  flux,  and  accounting  for  disturbance  forces  due  to  vehicle  acceleration,  motion  of 
the  shaft,  and  other  random  events.  The  nonlinearities  arise  in  the  square-law  nature  of 
magnetic  forces,  the  nonlinear  relationship  between  actuator  current  and  magnetic  flux, 
and  the  nonlinear  properties  of  materials  in  the  magnetic  circuit.  The  second  basic  cause  of 
difficulty  in  the  control  problem  stems  from  the  decisionto  use  digital  control.  Sampling  is 
inherent  in  digital  control  and  it  is  reasonable  to  expect  poorer  performance  from  a digital 
control  system  using  data  samples  than  from  its  ideal  analog  equivalent  using  continuous 
data.  This  inevitable  degradation  in  performance  encountered  in  moving  from  analog  to 
digital  control  must  be  compensated  for  by  the  use  of  more  sophisticated  digital  control 
algorithms  and  the  other  advantages  inherent  in  digital  control. 

A control  scheme  that  Is  effective  in  overcoming  these  two  basic  causes  of  difficulty 
in  the  control  problem  is  presented  in  this  paper.  The  scheme  is  based  on  the  theory  of 
fuzzy  systems.  The  modeling  problem  is  addressed  by  substituting  the  imprecise  linguistic 
model  of  fuzzy  theory  for  the  precise  model  of  physical  theory.  The  sampling  problem  is 
addressed  by  implementing  the  fuzzy  algorithm  in  a parallel  architecture  suitable  for  VLSI 
implementation  thereby  reducing  processing  time  and  allowing  high  sampling  rates. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2 describes  the  magnetic 
bearing  system  and  presents  a mathematical  model  developed  by  Feeley  et,  al.  [10].  In 
Section  3 some  essential  elements  of  fuzzy  control  theory  are  presented  and  an  adaptive 
fuzzy  controller  is  developed.  In  Section  4 the  performance  of  the  fuzzy  controller  is 
analyzed  using  a computer  simulation  based  on  the  nonlinear  model  of  Section  2.  A 
adaptive  fuzzy  control  VLSI  chip  architecture  is  outlined  in  Section  5 and  some  conclusions 
and  recommendations  are  given  in  Section  6 


2 Magnetic  Bearing  System 

A schematic  cross-sectional  side  view  of  NASA’s  magnetic  bearing  is  shown  in  Figure  1 
supporting  one  end  of  a rigid  shaft.  An  end  view  would  show  the  circular  cross-section 
shaft  centered  jn  the  annular  gap  created  by  the  bearing  housing  and  the  shaft.  Figure 
1 also  shows  the  shaft  magnetic  material  inlays  that  provide  paths  for  the  magnetic  flux 
produced  by  the  adjacent  bearing  actuators.  The  actuators  are  symmetrically  located  in 
the  bearing  housing  and  consist  of  magnetic  material  pole  pieces  and  coils  of  copper  wire. 
A position  sensor  is  located  close  to  each  actuator  to  measure  the  position  of  the  shaft.  A 
total  of  four  actuator  and  position  sensor  assemblies  are  located  at  90°  increments  around 
the  circumference  of  the  housing.  Coordinated  control  of  opposing  actuators  permits  posi- 
tioning of  the  end  of  the  shaft  anywhere  in  the  annular  gap.  An  identical  bearing  assembly 
supports  the  other  end  of  the  shaft.  For  simplicity,  rotational  forces  are  not  directly  ac- 
counted for  and  half  of  the  shaft  mass  is  assumed  to  be  concentrated  at  the  point  of  action 
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of  the  magnetic  forces  of  each  bearing  assembly. 


Figure  1:  Schematic  cross-section  of  magnetic  bearing  assembly 

Assuming  motion  in  the  one-dimensional  coordinate  system  defined  in  Figure  1,  appli- 
cation of  Newton’s  second  law  yields 

f 'y  = Fl  - F2  - Fd  - Ff 
dF  M 

where  y is  the  position  of  the  shaft,  F\  is  the  magnetic  force  exerted  by  the  upper 
actuator,  F2  is  the  magnetic  force  exerted  by  the  lower  actuator,  Fd  is  a disturbance  force, 
and  Ff  is  a viscous  friction  force.  F\  and  F2  are,  in  turn,  defined  by 

p0A  [ Ncii  l2 

— - 

4 Li/o  - y. 

and 

p0A  [ Nci2  ]2 

r2  — — 

4 [yo  - yj 
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where  fi  is  the  magnetic  permeability,  A is  the  area  of  one  pole  face,  Nc  is  the  number  of 
coil  turns,  i2  and  i2  are  the  coil  currents,  and  y0  is  the  initial  gap  distance.  The  friction  force 
is  assumed  proportional  to  the  square  of  the  shaft  velocity  and  is  modeled  mathematically 
as  Ff  = Kfv\v\i  ahd  the  disturbance  force  is  taken  as  an  exogenous  input. 

The  electromagnetic  of  the  actuator  are  modeled  with  the  aid  of  the  circuit  diagram 
of  Figure  2.  The  circuit  model  consists  of  two  loops,  one  for  the  primary  coil  current  ic, 
and  a second  for  the  induced  eddy  current  ie.  Applying  Kirchoff’s  voltage  law  to  each  loop 
yields  the  circuit  equations 


uc  = Rie  + Nc ^ + N„ 
at 

0 = Rtit  + Ne^  + N" 
at 


dt 

d<f>c 

dt 


where  vc  is  the  voltage  applied  to  the  coil,  Nc  is  the  number  of  turns  in  the  coil, 
is  the  flux  produced  by  the  coil  current,  Nee  is  the  number  of  turns  of  the  coil  linked  by 
the  flux  produced  by  the  eddy  currents,  <£e  is  the  flux  produced  by  the  eddy  currents,  Re 
is  the  resistance  of  the  eddy  current  paths,  Ne  is  the  number  of  turns  in  the  equivalent 
eddy  current  coil,  and  Nec  is  the  number  of  turns  in  the  equivalent  eddy  current  coil  linked 
by  the  flux  produced  by  primary  current.  Assuming  the  entire  mmf  drop  of  the  magnetic 
circuit  is  taken  across  the  two  air  gaps,  the  fluxes  can  be  expressed  in  terms  of  the  currents 
&s  — Mo2^',c  and  <f>e  — where  yag  is  the  distance  between  the  pole  piece  and  the 

shaft,  y0  — y for  the  upper  gap  and  y0  -f  y for  the  lower  gap.  Solving  these  equations  for 
the  time  derivatives  of  the  currents  leads  to 


kj L- 

_L 

1 71 

£,  *«» 

i 

NCLX  V 

— 

k „ 
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where  Li  = L > = JES5J.  a = KN,  - N«N,',  and  k = The 

equations  presented  in  this  section  constitute  a consistent  mathematical  model  relating 
the  input  voltages  applied  to  the  actuator  coils,  ucl  and  vc2  , to  the  position  of  the  shaft, 

y- 


3 Fuzzy  Control 

Conventional  feedback  control  systems  measure,  relatively  precisely,  certain  process  vari- 
ables, operate  on  these  measurements  with  a control  algorithm  to  produce  precise  command 
signals,  and  apply  these  command  signals  to  the  process  to  control  its  behavior  in  some 
desired  way.  The  control  algorithm  generally  relies  on  an  explicit  mathematical  model 
of  the  system  to  be  controlled  and  some  expression  of  desired  system  performance.  A 
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Figure  2:  Circuit  model  of  actuator. 

crucial  element  in  control  algorithm  design  is  the  development  of  a suitable  mathematical 
model  of  the  system;  in  general,  performance  of  the  controlled  system  will  be  no  better 
than  the  system  model  on  which  the  control  algorithm  is  based.  The  model  should  be 
neither  too  complicated,  making  the  control  algorithm  too  complex  to  implement,  nor  too 
simple,  missing  essential  features  of  system  behavior.  Since  most  systems  requiring  au- 
tomatic feedback  control  are  dynamic  and  nonlinear,  the  development  of  a simple  model 
that  still  captures  the  essence  of  important  system  performance  characteristics  is  usually 
a time-consuming,  and  in  some  cases,  impossible,  task. 

It  is  interesting  to  compare  these  automatic  control  systems  with  manual  control  sys- 
tems where  a human  operator  makes  seemingly  imprecise  measurements,  processes  them 
rapidly  in  the  brain,  and  produces  the  correct  control  command  to,  say,  ride  a bicycle. 
While  it  may  not  be  impossible  to  build  an  automatic  control  system  to  control  a bicycle 
(although  we  have  never  seen  one),  it  would  certainly  be  quite  difficult.  Yet,  a young  child 
can  become  a proficient  rider  after  only  a short  training  session  with  no  knowledge  whatso- 
ever of  the  mathematics  of  bicycle  dynamics.  It  is  this  paradox  that  led  Zadeh  [11]  to  the 
development  of  the  theory  of  fuzzy  sets,  Mandami  [12]  to  consider  the  linguistic  synthesis 
of  fuzzy  control  systems,  and,  most  recently,  Kosko  [13]  to  explore  its  connections  with 
neural  networks  in  the  adaptive  control  of  dynamic  systems. 

As  with  neural  network  controllers,  fuzzy  controllers  try  to  emulate  the  functions  of 
the  human  brain.  A fundamental  difference  between  the  two  is  that  neural  controllers 
assume  no  a’  priori  knowledge  of  system  behavior,  while  fuzzy  controllers  start  with  a 
linguistic  description  of  whatever  is  known  about  the  system.  There  is,  however,  a striking 
similarity  at  the  implementation  level  between  neural  network  controllers  and  adaptive 
fuzzy  controllers  [13]. 
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3.1  Fuzzy  Variables  and  Fuzzy  Values 

The  notions  of  fuzzy  control  are  rooted  in  the  theory  of  fuzzy  sets  [11].  The  basic  difference 
between  conventional  (crisp)  set  theory  and  fuzzy  set  theory  lies  in  the  values  assigned  to 
the  variables.  Consider,  for  example,  a variable  called  position,  y.  In  crisp  theory  y could 
take  on  values,  say,  from  Om  to  +10m.  At  any  particular  point  in  time,  the  position  of  an 
object  could  be  given  by  the  value,  say,  4m.  In  fuzzy  theory,  however,  the  values  assigned 
to  the  position  variable,  y,  are  of  not  the  familiar,  crisp,  numerical  type  but,  rather,  an 
unfamiliar,  fuzzy,  Enguistic  type;  e.g.  “close”,  or  “far”,  or  “very  far”.  This  is  consistent 
with  the  child  bicycEst’s  assessment  of  position  relative  to  an  upcoming  tree.  Since  one 
of  the  strengths  of  fuzzy  theory  is  that  it  is  basically  quantitative  in  nature,  it  remains 
to  relate  the  fuzzy  values  “close”,  etc.  to  appropriate  numerical  values  in  a fuzzy  way 
consistent  with  our  notion  of  the  meanings  of  the  corresponding  Enguistic  values.  In  the 
example  considered  above,  “close”,  “far”,  and  “very  far”  may  be  characterized  by  the 
distributions  shown  in  Figure  3 where  the  abscissa  is  the  distance  from  the  tree  and  the 
ordinate  is  the  degree  to  which  “close”,  etc.  is  an  accurate  representation  of  the  distance 
to  the  tree.  Certainly,  if  the  cycEst  is  about  to  hit  the  tree  it  is  “close”  while  if  it  is  10m 
away  it  is  not.  If,  however,  it  is  4m  away  it  is  only  "close”  to  a degree;  more  specifically 
“close”  is  an  accurate  description  of  the  distance  4m  with  degree  0.21,  while  “far”  is  an 
accurate  description  of  this  same  distance  with  degree  0.64,  and  “very  far”  is  not  at  all 
accurate  and,  so,  is  descriptive  with  degree  0.0.  This  subjective  assessment  of  “closeness”, 
etc.  is  introduced  by  the  designer  in  the  development  of  these  distributions,  or  as  they  are 
known  in  fuzzy  theory,  “membership  functions”.  To  summarize,  it  is  correct  to  think  of  the 
the  fuzzy  values  “close”,  etc.  as  “fuzzy  numbers”  whose  relationship  to  “crisp  numbers” 
is  provided  by  a defining  membership  function. 
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Figure  3:  Membership  functions  for  the  fuzzy  values  “Close”,  “Far”,  and  “Very  Far”  of 
the  fuzzy  variable  “Position”  . 
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3.2  Fuzzy  Functions 

Analogous  to  the  function  of  crisp  mathematics  that  maps  crisp  input  variables  into  crisp 
output  variables,  fuzzy  mathematics  uses  a relational  matrix  to  map  fuzzy  input  variables 
into  fuzzy  output  variables.  The  relational  matrix  is  constructed  from  a linguistic  rule 
base  relating  fuzzy  input  variables  to  fuzzy  output  variables.  The  linguistic  rule  base 
may  be  generated  from  a set  of  logical  implications  of  the  “IF-THEN”  type.  Consider, 
for  example,  a system  with  two  fuzzy  input  position  variables  and  , and  one  fuzzy  output 
steering  variable  0.  Let  the  possible  fuzzy  values  of  be  “left”  (L),  “center  (C),  and  right 
(R),  let  the  possible  fuzzy  values  of  be  “close”  (C),  “far”  (F),  and  “very  far”  (VF),  and 
the  possible  fuzzy  values  of  be  “left”  (L),  “center”  (C),  and  “right  (R).  A brief  linguistic 
rule  base  might  then  consist  of  the  following  logical  implications: 

1.  IF  [x  is  L and  y is  C]  THEN  \0  should  be  R] 

2.  IF  [x  is  R and  y is  C]  THEN  [6  should  be  L] 

3.  IF  [x  is  C and  y is  V]  THEN  [6  should  be  C] 

The  relational  matrix  embodying  these  rules  is  shown  in  Figure  4,  and  is  seen  to  be 
a concise  display  of  the  relationship  between  the  pairs  of  fuzzy  values  of  the  fuzzy  input 
variables  and  the  fuzzy  values  of  the  fuzzy  output  variable.  It  is  interesting  to  note  that 
the  relational  matrix  is  not  necessarily  full.  An  important  and  powerful  aspect  of  fuzzy 
control  is  that  only  those  rules  that  are  well  known  need  be  specified,  the  fuzzy  calculations 
will  “interpolate”  or  “extrapolate”  to  fill  in  missing  rules.  The  fuzzy  calculations  will  also 
resolve  conflicting  rules  in  an  optimal  way  consistent  with  the  specified  linguistic  rule  base 
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Figure  4:  Relational  matrix  mapping  fuzzy  input  variables  x and  y to  fuzzy  output  variable 
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3.3  Fuzzy  Controller 

Fuzzy  control  systems  inevitably  interact  with  the  physical  world  of  crisp  measurements 
and  actuators.  On  input  to  the  controller,  crisp  values  of  crisp  variables  are  converted  to 
fuzzy  values  of  fuzzy  variables  according  to  the  membership  function  of  the  fuzzy  variable. 
For  example,  in  Figure  3 a crisp  value  of  “4"  of  the  crisp  variable  position  would  take 
on  two  fuzzy  values  “close”  and  “far”  of  the  fuzzy  position  variable.  The  membership 
functions  indicate  that  the  crisp  value  “4”  is  the  fuzzy  value  “close”  with  degree  0.21  and 
the  fuzzy  value  “far”  with  degree  0.64.  Thus,  a single  measurement  of  a crisp  variable  may 
activate  a number  of  rules  in  linguistic  rule  base  or,  equivalently,  the  relational  matrix. 
Each  rule  will  operate  on  its  fuzzy  input  variables,  and  their  membership  functions,  to 
produce  a modified  membership  function,  or  fuzzy  value,  for  the  the  fuzzy  output  variable. 
The  specific  form  of  tHe  output  membership  function  may  be  determined  either  by  the 
correlation-minimum  or  the  correlation-product  inferencing  technique  [13] . Since  more 
than  one  rule  may  be  activated  by  a single  measurement  it  follows,  then,  that  a number 
of  fuzzy  values  of  the  output  may  also  be  generated.  The  output  membership  functions 
generated  by  the  firing  of  several  rules  may  be  combined  in  a number  of  different  ways 
to  produce  a single  crisp  output  to  activate  a physical  actuator.  Two  commonly  used 
methods  are  the  mean-of-maxima  and  the  centroid  methods[13]. 

The  fuzzy  controller  under  development  for  the  magnetic  bearing  has  two  fuzzy  input 
variables,  position  y and,  change  in  position  dy\  and  one  fuzzy  output  variable,  actuator 
voltage  v.  Each  fuzzy  variable  may  take  on  each  of  seven  fuzzy  values:  “negative  large” 
(NL),  “negative  medium”  (NM),  “negative  small”  (NS),  “zero”  (ZE),  “positive  small” 
(PS),  “positive  medium”  (PM),  and  “positive  large”  (PL).  The  fuzzy  values  of  the  input 
variables  are  shown  over  their  corresponding  universe  of  discourse  in  Figure  5.  The  universe 
of  discourse  ranges  from  -5  volts  (corresponding  to  a shaft  position  of  -19^m)  to  +5  volts 
(corresponding  to  a position  of  +19//m).  Fuzzy  values  are  trapezoidal  in  shape  with  a 
maximum  overlap  of  25%,  and  are  narrower  near  zero  to  provide  finer  control  close  to  the 
desired  value. 


Degree  of  membership  (units) 


Universe  of  discourse  (volts) 


Figure  5:  Fuzzy  values  of  input  variables  y and  dy. 

Fuzzy  values  of  the  output  variables  are  shown  in  Figure  6.  They  are  triangular  in 
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shape,  have  a maximum  overlap  of  25%,  and  are  closer  together  near  zero  to  provide  finer 
control.  The  exact  shapes  and  locations  of  the  fuzzy  input  and  output  variables  are  design 
parameters  whose  optimal  values  are  found  by  numerical  experimentation. 


Degree  of  membership  (units) 


Universe  of  discourse  (volts) 


Figure  6:  Fuzzy  values  of  the  output  variable  v. 

The  7x7  relational  matrix  relating  the  fuzzy  input  pairs  to  fuzzy  values  of  the  output 
is  shown  in  Figure  7.  The  relationship  between  the  relational  matrix  and  the  corresponding 
set  of  forty  nine  IF-THEN  implications  is  obvious. 

The  correlation-minimum  inference  procedure  is  used  to  process  activated  rules  result- 
ing in  a truncation  of  the  output  membership  function  at  the  minimum  value  of  the  two 
input  membership  functions.  Note  that  since  a maximum  of  two  input  values  overlap,  a 
maximum  of  four  (as  opposed  to  a possible  maximum  of  forty  nine)  rules  can  be  activated 
at  once.  Combining  of  output  fuzzy  values  and  subsequent  defuzzification  is  performed 
using  the  centroid  method. 


4 Performance  of  Magnetic  Bearing  with  Fuzzy  Con- 
troller 

A linearized  version  of  the  nonlinear  model  presented  in  Section  2 was  programmed  using 
Matlab  to  test  the  performance  of  the  fuzzy  controller.  Figure  8 shows  the  response  to  a 
3.8pm  (1  volt)  step  demand  change  in  position.  The  figure  shows  that  the  fuzzy  controller 
was  successful  in  stabilizing  the  bearing  and  that  response  time  is  short.  Sampling  fre- 
quency was  10  K Hz.  Oscillations  are  small  and  can  be  further  reduced  by  reducing  the 
size  of  the  fuzzy  sets  representing  zero  error.  Steady  state  error  can  be  further  reduced 
by  adding  an  integral  mode  to  the  controller.  These  results  are  not  surprising  since  the 
present  fuzzy  controller  uses  only  position  and  velocity  inputs  and  is  essentially  operating 
as  a proportional-plus-derivative  controller.  Additional  work  is  being  conducted  to  cor- 
rect these  deficiencies.  Several  promising  adaptive  control  policies  are  being  investigated 
including  modifying  the  input  fuzzy  set  sizes  and  overlap,  the  output  fuzzy  set  centroids, 
and  the  scaling  gains  Ke,  Kc,  and  Kv.  Best  results  were  obtained  with  25%  set  overlap 
and  Kt  = 1,  Kc  — 18,  and  Kv  — 5. 


Figure  7:  Relational  matrix  for  the  magnetic  bearing  controller. 

5 Architecture  for  a Fuzzy  VLSI  Chip 

The  architecture  of  a fuzzy  VLSI  chip  is  outlined  in  Figure  9.  The  basic  fuzzy  control 
algorithm  is  contained  on  a single  chip.  Rules  are  downloaded  from  a host  computer  at 
start-up  and  can  be  modified  by  the  host  computer  later.  The  chip  is  of  the  all-digital 
type  so  off-chip  A/D  and  D / A converters  are  required.  The  fuzzy  control  algorithm  has 
four  parts:  1)  input  calculations,  2)  input  membership  determination,  3)  rule  evaluation, 
and  4)  output  defuzzification  as  described  below. 

5.1  Input  Calculations 

The  single  input  to  the  chip  is  the  position  error  in  volts.  The  current  error  er(n)  and 
the  previous  error  er(n  — 1)  are  each  stored  in  a separate  registers.  The  current  change  in 
error  ch(n)  = er(n)  — tr(n  — 1)  is  computed  and  stored  in  a third  register.  The  variables 
ER  and  C H , used  in  membership  function  determination  are  found  by  multiplying  er(n) 
by  and  ch(n)  by  KC}  respectively.  The  scaling  gains  Ke  and  Kc  are  downloaded  from 
the  host  computer  and  may  be  modified  as  required. 

5.2  Input  Membership  Determination 

The  input  membership  determination  is  made  by  a table  Took- up.  There  are  two  look-up 
tables  one  for  ER  and  one  for  Off.  The  output  of  the  table  look-up  is  the  modified  fit 
vector  (^4,  mA)  tob).  Each  look  up  table  is  of  size  3 by  m by  n. 
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5.3  Rule  Evaluation 

Four  rules  are  evaluated  for  each  input  pair.  Each  evaluation  finds  the  minimum  of  the 
input  fuzzy  sets  and  the  centroid  of  the  output  fuzzy  set.  A 4 by  n hold  the  minimum 
input  membership  values  and  a 4 by  m register  hold  the  corresponding  centroids. 

5.4  Output  Defuzzification 

Defuzzification  is  done  in  three  steps.  First,  the  minimum  membership  value  is  multiplied 
by  the  centroid  for  each  of  the  rules  activated.  Second,  each  of  these  products  is  summed 
to  produce;  at  the  same  time  each  of  the  minimum  membership  values  is  also  summed. 
Finally,  the  sum  of  the  minimum  membership- centroid  products  is  divided  by  the  sum  of 
the  minimum  memberships  to  produce  the  desired  result.  The  result  is  then  multiplied  by 
an  output  voltage  scaling  gain  Kv. 


6 Summary  and  Conclusions 

A mathematical  model  of  a magnetic  bearing  was  presented  and  was  used  to  develop  a 
computer  simulation  model  to  test  alternative  magnetic  bearing  control  systems.  A fuzzy 
control  system  for  was  developed  and  tested  by  computer  simulation.  Initial  results  show 
that  the  fuzzy  controller  stabilizes  the  magnetic  bearing  and  produces  acceptable  steady- 
state  and  transient  behavior.  Further  research  is  being  conducted  to  optimize  the  fuzzy 
controller  and  to  develop  suitable  adaptive  algorithms.  Particular  emphasis  is  being  placed 
on  achieving  zero  steady-state  error  and  rejecting  acceleration  disturbances.  Performance 
comparisons  between  the  fuzzy  controller  and  a linear-quadratic-gaussian  regulator  are 
being  conducted.  A candidate  VLSI  chip  architecture  has  been  proposed  to  implement 
the  fuzzy  control  algorithm  and  provide  rapid  sampling  for  real-time  control.  VLSI-based 
fuzzy  control  appears  feasible  for  real-time  control  of  uncertain  nonlinear  systems  like  an 
active  magnetic  bearing. 
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Direct  Kinematics  Solution  Architectures 
for  Industrial  Robot  Manipulators: 

Bit-Serial  Versus  Parallel 
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Abstract  - This  paper  investigate  a VLSI  architecture  for  robot  direct  kine- 
matic computation  suitable  for  industrial  robot  manipulators  The  Denavit- 
Hartenberg  transformations  are  reviewed  to  exploit  a proper  processing  ele- 
ment, namely  an  augmented  CORDIC.  Specifically,  two  distinct  implementa- 
tions are  elaborated  on,  such  as  the  bit-serial  and  parallel.  Performance  of  each 
scheme  is  analyzed  with  respect  to  the  time  to  compute  one  location  of  the 
end-effector  of  a 6-links  manipulator,  and  the  number  of  transistors  required. 


1 CORDIC  Techniques 


The  matrix  Aj  describing  the  jth  link  is  proposed  to  be  implemented  via  4 CORDICs: 
parallel  two  for  the  w-axis  operation,  and  another  parallel  two  for  the  x-axis.  Since,  the 
rotation  and  translation  are  disjoint  each  other,  the  4 CORDIC  can  be  done  via  a 2-stages 
cascade  [5]. 

Let  the  jth  joint  orientation  vector  denote  by  pj,  where  pj  = AjPj-\.  Consider  an 
intermediate  vector  pf , between  pj  and  pj~i'. 

Pj  = Trans(wj^i,dj)Rot(wj_i,9j)pf  : stage  — 1 (1) 

p*  = Trans(xj  ,aj)Rot(xj,xl>j)pj-\  : stage  — 2.  (2) 

One  set  of  transformations  for  each  stage,  i.e.  Trans(w,  d)Rot{w,  6),  is  a block- diagonal 
matrix  and  can  be  orthogonally  implement  able  by  two  2x2  matrix  transformations.  Note 
that  is  implementable  through  an  augmented  PE,  rather  two  different  PEs,  observing  that 
Trans(w,  d ) is  a trivial  operation.  Then, 


Vi 


Rot(wj , 0j) 


Wj 

1 


0 


pf 


Vi  = 


o 


: Trans(wj,dj) 


(3) 
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Pj  (also,  similarly  for  pf ) is  decomposed  into  two  blocks,  e.g  the  first  two  elements  of  pj 
becomes  one  vector  Xj : 


Pj  j > ^ j > l]  ) 5 “h  ^ > l]> 


(4) 


where  u>j  is  for  the  w-axis  component  of  the  vector  pj , and  Xj  for  x-  and  y-axis  components 
rotated  by  6j.  In  a similar  way,  for  pf  we  can  choose  a rotated  vector  of  y-  and  w-axis 
disjointly  through  axis  shuffling.  Finally,  consecutive  n-pairs  of  rotation  and  translation 
can  be  implemented  via  a 2n-stages  cascade.  We  will  name  each  stage  as  a macro-PE 
(or,  an  augmented  PE),  which  can  be  2n-pipelined  to  compose  an  n-Hnks  computation 
processor.  Not  to  differentiate  the  two  different  sets  of  transformations,  w-axis  and  X-axis 
respectively,  we  employ  index  i in  unified  notations  for  a macro-PE:  for  a reference  axis 
Wi, there  are  rotation  of  and  translation  of  d*,  X;_!  = for  an  input,  and 

X;  — (liji/j)  for  an  output. 

Each  macro-PE  including  one  Trans(wi,di ) and  one  Rot(wi,9i ) can  be  implemented* 
as  in  Figure  l.a.  One-joinf  processor  is  shown  in  Figure  i.b.  Finally,  for  a 6-joints  system, 
Figure  l.c  shows  a fully  pipelined  structure.  

From  this  point,  we  will  concentrate  on  implementation  of  a macro-PE.  Observing 
that  Rot  and  Trans  functions  are  disjoint  each  other,  let  us  isolate  the  rotation  part  at 
first.  This  vector  rotation  for  Xj  = by  the  angle  Sj  can  be  realized  by  an  iteration 

algorithm  called  CORDIC  [4]  instead  of  computing  trigonometric  functions  and  applying 
matrix  multiplication.  CORDIC  realizes  a vector  rotation  by  a partial  sum  of  micro-angle 
rotations  with  a pre-fixed  sequence  of  angles.  When  the  rotation  macro-angle  is  represented 
as  a sum  of  decomposed  micro-angles,  i.e 


^ = n h 


1 —tan8i'  k 

tanBx 1 


(5) 


where  kk  = cos6itk  is  a micro-scale  composing  a final  scale  factor,  explained  later.  Such 
a specific  form  of  the  pre-fixed  micro-angle  sequence  as  tan-1  2~‘,  is  attractive  for  VLSI 
implementation  since  it  is  composed  only  of  additions,  shiftings,  and  a arctangent  lookup 
table  For  the  simplicity  of  notation,  subscript  i indexing  a certain  stage  will  be  omitted, 
and  X,Y  and  Z stand  for  abridged  notations  for  those  having  subscript  i. 

Non-redundant  : The  micro-iterations  of  the  conventional  (hereafter,  it  will  be  called 
non-redundant  ) CORDIC  are  3 linear  recursive  equations:  X recurrence  (X-rec.),  Y- 
recurrence  (Y-rec.)  and  Z-recurrence  (Z-rec.)  |4j. 


X[i  + 1]  = X[i]  -f  (T{2  *Y[i] 

Y[i  + 1]  - Y\i]  - cr^-Xfi] 

Z[i  + 1]  = Z[i]  — (t { tan-1  2~'  (6) 


With  an  initial  value  of  Z[ 0]  = 9; , CORDIC  rotates  initial  values  of  X[0]  and  Y[0],  to  the 
last  value  X[n]  and  Y[n],  while  making  Z[i ] close  to  zero,  so  that  Z[n ] is  forced  to  be  zero. 
With  n number  of  iterations,  n-bit  accuracy  of  X and  Y in  the  output  can  be  achieved. 
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Figure  1:  CORDIC-based  Pipelined  Architecture  for  Direct  Kinematics  Computation:  a. 
A macro-PE,  One-stage  from  an  orientation  to  an  intermediate,  b.  2-stages  cascade,  An 
A{  transformation  module  for  a link,  c.  A complete  pipelined  Computation  Module  for 
6-links  system. 
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For  a known  angle,  the  direction  of  the  rotation,  can  be  pre-computed  or  calculated  one 
by  one  on-the-fly  using  the  following  selection  function. 

1 if  Z[i]  > 0 

-1  if  Z[i]  < 0 

The  CORDIC  rotation  does  not  preserve  the  input  norm.  To  get  a rotated  vector  having 
the  same  length  as  the  input  (X[0],  Y [0]),  X[n](F[n])  needs  to  be  compensated  by  a scaling 
factor  K 

K = 

where  ||  -||  stands  for  the  norm  of  the  vector.  Note  that  K is  constant  for  the  non-redundant 
scheme  since  is  in  {-1,  1}. 

Redundant  : Non-redundant  CORDIC  is  slow  inherently  with  delay  of  0(n2)  due  to 
its  recursiveness  and  serial  dependency,  since  a micro-rotation  with  delay  0(n)  should  be 
finished  before  processing  the  next  micro-rotation.  Delay  performance  of  a macro-rotation 
(n  micro- rotations)  can  be  improved  from  0(n2)  to  0(n)  by  using  redundant  arithmetic 
(carry-free  addition  such  as  carry  save  or  signed-digit  addition)  to  determine  the  direction 
of  the  rotation  dj,  based  on  an  estimate  instead  of  an  exact  value  [9].  The  redundant 
arithmetic  gives  a delay  of  0(1)  instead  of  O(n),  and  the  estimation  of  direction  is  necessary 
not  to  erode  the  advantage  of  0(1).  This  requires  the  modification  of  the  recurrences  and 
selection  function.  This  redundant  CORDIC  scheme  produces  the  output  about  4 times 
faster  than  the  non-redundant.  However,  it  introduces  additional  cost  since  the  scale  factor 
K is  variable  depending  on  a macro-angle  by  allowing  dj  to  be  in  {-1,  0,  1}. 

Constant-Factor-Recjundant : To  reduce  implementation  cost  of  redundant  CORDIC, 
it  would  be  good  to  have  a constant  scale  factor  by  forcing  dj  in  {-1,  1}.  However,  since  dj 
is  determined  from  an  estimate,  there  arises  a convergence  assurance  question.  There  was 
proposed  a scheme  appending  correcting  iteration  stages  at  proper  positions  [10].  Along 
to  this  idea,  the  number  of  extra  correcting  iterations  is  further  reduced  by  dividing  the 
micro-iterations  (for  » = 0to»  = n — 1)  into  two  groups:  one  group  where  the  direction  of 
the  rotation  is  in  {-1,  1}  for  i — 0 to  i = n/2  and  the  other  in  {-1,  0,  1}  for  i = (n  -f  l)/2 
to  i — n — 1 correcting  iterations  by  50  % since  correcting  iteration  is  not  needed  for  the 
second  half  of  the  micro-iterations  and  we  still  obtain  a constant  scale  factor  K since  the 
value  of  K in  n-bit  precision  does  not  depend  on  the  d value  for  (n  + l)/2  < t < (n  — 1).  Z- 
recurrence  also  can  be  modified  so  that  d;  is  determined  quickly  by  looking  at  a few  most 
significant  bits.  This  new  scheme  is  called  Constant-Factor-Redundant-CORDIC(CFR- 
CORDIC).  The  modified  recurrences  and  selection  functions  for  the  scheme  are  described 
below. 


EBM-lfi  I cr22-" 
||[jr[o],y[o]]‘||  ~MV1+  4 ’ 


x[i  + 1]  = x[i\  + di2-’y[f] 

Y[i  + l]  = Y[i]  - d^-'Xfi] 
U[i  + 1]  = 2 (U[i)  - d.21'  tan"1  2~‘) 


(9) 


3rd  NASA  Symposium  on  VLSI  Design  1001 


6.2.5 


where  U[i]  is  for  the  implementation  simplicity,  which  is  equal  to  2*Z[i],  and  the  selection 
function  is  given  as  follows: 

' 1 if  U[t\  > 0 

or  U[i ] = 0 fl  i < n 
“I  0 U[i]  = 0 n i > n/2 
k -1  if  U[i]  < 0 

When  t fractional  bits  are  used  in  the  estimate  value,  i.e.,  U[i]  is  computed  using  t 
fractional  bits  of  redundant  representation  of  U[i]t  the  following  correcting  iteration  need 
to  be  included,  where  the  interval  between  indexes  of  correcting  iterations  should  be  less 
than  or  equal  to  (<  — 1)  up  to  the  last  iteration  index  equal  to  n/2.  When  the  correction 
stage  is  necessary  at  the  jih  step  of  micro-iteration, 

uc[j  + 1]  = U\j  + 1]  - 2af2jtan~j2-’  (11) 

with  the  direction  of  the  rotation  determined  from  the  same  selection  function  of 
eq.(  10),  except  being  decided  based  on  U[j  + 1]  instead  of  U[i]. 

So  far,  we  discussed  about  recursive  structures  of  several  CORDIC  schemes  to  imple- 
ment the  rotation  part  in  the  basic  PE,  as  depicted  in  Figure  1.  The  PE,  augmented  by  a 
translator,  necessitates  scaling  operation  at  each  stage,  because  shuffling  of  the  output  at 
each  stage  makes  continuous  accumulation  of  the  scaling  factor  complex  to  be  processed 
at  the  final  stage.  The  scaling  operation  has  been  solved  either  by  an  explicit  way  or  an 
implicit.  The  explicit  way  is  dividing  the  rotated  vector  by  a constant,  which  is  known  for 
the  non-redundant,  to  be  calculated  while  running  the  micro-steps  of  CORDIC  [4,9].  The 
division  can  be  processed  by  another  CORDIC  (in  a linear  mode)  or  a divider.  The  implicit 
approach  reconfigures  the  sequence  of  micro-iterations  of  the  CORDIC,  eventually  to  have 
a different  norm  from  that  without  scaling  micro-iterations.  Scaling  micro-iterations  target 
in  general  at  making  the  adjusted  scaling  factor  in  a form  of  2*  or  1,  which  can  be  easily  set 
to  the  unit  size.  Each  micro-iteration  can  be  composed  of  i)  reduction  axis-scaling  [11], 
ii)  repetition  of  vector-scaling,  iii)  expansion  axis-scaling  or  combinations  thereof  [12]. 
Relevant  issues  regarding  solution  search  are  to  be  further  studied,  more  than  the  greedy 
method  or  the  decomposed  [13].  In  summary,  the  explicit  scaling  almost  doubles  the 
system  complexity,  while  the  implicit  increases  25  % for  the  non-redundant  and  about  30 
% for  the  redundant. 

2 Application  to  Direct  Kinematics 

In  this  section,  we  design  an  architecture  for  the  direct  kinematics  computation,  based  on 
CFR-CORDIC.  The  data-path  is  the  parallel.  To  analyze  its  performance,  we  will  define  a 
new  measure,  namely  one-position  calculation  time.  Via  this  measure,  we  will  also  analyze 
performance  the  bit  serial  architecture  similarly  implementable  as  in 
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2.1  Performance  Measure 

Let’s  define  the  following  parameters. 

6,  : the  number  of  bits  in  each  input  x,y  and  w 

bf  : the  number  of  bits  in  each  output 

rif  : the  number  of  links  (=6) 

fc  : the  available  data  shift  rate 

A : the  step  time  per  micro  CORDIC  iteration 

fi  : the  input  bit  rate 

Additionally,  we  define  a measure  parameter  Ta, 

7a  = step-time(A)  * number  of  steps, 

to  compare  the  performance  of  various  schemes.  For  a discrete  element  implementation, 
A corresponds  to  one  single  external  clock  time  1 //c.  Note  that  A varies  depending  on  a 
particular  implementation  of  a macro- PE.  Without  loss  of  generality,  let’s  define  the  unit 
of  A to  be  1 for  one-bit  full  addition  time.  The  input  processing  rate  can  be  alternatively 
interpreted  as 


which  limits  the  maximum  rate  of  input  vector  sampling  to  be  processable  through  an 
implemented  processor. 


k < J_ 

b,  Ta,  ’ 


(12) 


2.2  Performance  Comparison 

Bit  Serial:  A macro-PE  using  serial  data  path  and  arithmetic  units  for  CORDIC  is  shown 
in  Figure  2 [6].  Figure  2. a shows  symmetric  components  of  a bit-serial  PE  in  x,  y and  w 
representation,  and  Figure  2.b  is  for  the  detail  of  each  block  (X-recurrence  or  Y-recurrence) 
employing  bit  serial  arithmetic.  W-recurrence  is  in  Figure  2.c,  and  Z-recurrence  in  Figure 
2.d.  The  x and  y components  of  the  input  vector  are  taken  initially  as  X[0]  and 

Y[0],  and  the  initial  angle  Z[ 0]  is  set  to  the  corresponding  joint  angle.  After  performing  n 
micro-iterations,  CORDIC  produces  n-bit  precision  outputs  leading  to  X,. 

In  the  serial  scheme  without  macro-pipelining,  denote  a basic  step- time  as  Ai,  which 
is  equivalent  to  A.  To  use  one  adder  recursively  rif  times  to  process  an  n/  links, 

7a,  — Ai  * nf(bf  + bi(bi  + log2bi)), 

where  the  output  has  bf  bits  buffer. 

CFR-Redundant  Parallel  : To  increase  the  throughput  of  the  previous,  the  bit- 
serial  PEs  can  be  substituted  by  those  using  parallel  arithmetic.  When  parallel  arithmetic 
and  non-redundant  CORDIC  are  adopted,  the  corresponding  parameter  becomes 

Ta,  = A2  * nf(bi  + log2bi)  _ „ 

where  Aj  equals  to  the  time  for  one  micro-rotation  (time  for  variable  shifter  plus  time  for 
carry-propagate  addition),  approximately  2log2  bx  assuming  fast  variable  shifter  and  carry- 
propagate  adder.  The  step  time  can  be  further  shortened  by  adopting  CFR-CORDIC, 
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xi 


(2.b) 


w . 
1 


n.c) 


-1  -i 
tan  2 


(2.d) 


Figure  2:  A bit-serial  PE  : a.  A macro-PE  with  X-,  Y-  and  W-recurrence,  b.  Detail  of 
either  block,  c.  W-recurrence,  d.  Z-recurrence. 
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where  a carry-free  adder  (signed-digit  adder)  is  replaced  for  carry-propagate  adder.  Figure 
3. a shb#i  a macro-PE  in  components,  and  Figure  3.b  is  for  the  detail  of  each  block  (X- 
recurrelice  or  Y-reCUrrence)  employing  parallel/redundant  arithmetic.  Z-recurrence  is  in 
Figure  3.c. 


Figure  3:  A parallel/redundant  PE:  a.  A macro-PE  with  X-  and  Y-recurrence,  b.  Detail 
of  either  block,  c.  Z-recurrence. 
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Description 

A;/ A 

TA. 

Processing 

rate 

TRs 

estimate 

Bit-serial 

1 

1200A 

6ook 

2K 

(parallel) 

4M 

12K 

Parallel(CFR) 

5 

500A 

2M 

6K 

(parallel) 

10M 

40K 

Table  1:  Time  and  complexity  comparison 


In  this  case,  the  sign  of  Z[i]  at  the  ith  micro-iteration  can  not  be  detected  by  looking 
at  the  most  significant  bit  since  Z[i]  is  in  redundant  number  representation.  To  determine 
the  sign  of  Z[i]  quickly  by  looking  at  a few  significant  bits,  CFR-CORDIC  uses  an  estimate 
of  shifted- Z[i]  (C7[zj)  using  t fractional  bits.  As  discussed  earlier,  the  number  of  fractional 
bits  used  for  the  estimate  also  determines  the  frequency  rate  of  a correcting  iteration:  more 
fractional  bits  are  used,  less  number  of  correcting  iterations  are  required.  Let  the  number 
of  correcting  iterations  denoted  by  77.  The  corresponding  T&3  becomes 

TAj  = A3  * nf(bi  + log2bi  + 77) 

where  A3  equals  to  the  time  for  carry-free  addition  plus  the  time  for  the  maximum  of  a 
selection  function  and  a variable  shifter,  approximately  (1  + log2bi).  Note  that  a practical 
number  of  correcting  iterations  is  much  smaller  than  6;,  e.g.  1 for  the  16bit  resolution. 
Hence,  we  can  approximate  TAj  to  be  that  for  the  redundant  without  a correcting  iteration. 

For  a case,  b{  = 12,  b{  = 16,  the  estimated  TA  is  summarized  in  Table  1.  To  get  first 
order  estimates  of  available  speed  and  area,  we  use  a figure  that  one  full  adder  (also  one 
bit  shifter)  requires  approximately  50  TRs  and  one  20nsec  clock  cycle  [14], 


3 Conclusion 

We  have  examined  various  kind  of  CORDIC  schemes  as  a macro-PE  module  for  the 
direct  kinematics  processor,  and  showed  that  its  micro-level  regularity  is  suitable  for 
VLSI  implementation,  depicted  along  with  specific  schematics  which  include  the  conven- 
tional non-redundant,  the  redundant  and  the  Constant-Factor-Redundant  schemes.  The 
cost-effectiveness  of  selected  architectures  has  been  analyzed  using  bit-serial,  parallel  or 
pipelined  structure  with  respect  to  the  time  and  the  number  of  modules  required,  to 
compute  one  location  of  the  end-effector  for  a 6-links  manipulator,  given  a set  of  angle 
measurements  The  comparison  table  exhibits  the  CORDIC-based  robotics  processor  as  a 
prospective  solution  in  VLSI  to  be  used  for  a wide  range  of  kinematics  calculation  require- 
ment, compromising  the  size  versus  speed. 
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Abstract-  A design  technique  for  microprocessors  combining  the  simplicity 
of  RISCs  with  the  richer  instruction  sets  of  CISCs  is  presented.  They  utilize 
the  pipelined  instruction  decode  and  datapaths  common  to  RISCs.  Instruc- 
tion invariant  data  processing  sequences  which  transparently  support  complex 
addressing  modes  permit  the  formulation  of  simple  control  circuitry.  Compact 
implementations  are  possible  since  neither  complicated  controllers  nor  large 
register  sets  are  required. 


1 Introduction 

The  design  of  microprocessors  has  evolved  considerably  since  the  introduction  of  the  first 
microprocessor  in  1971  [3].  Traditional  microprocessors  are  extremely  complicated  ma- 
chines that  support  hundreds  of  instructions  and  a dozen  or  so  addressing  modes.  The 
dominance  of  such  complex  instruction  set  machines  (CISC)  has  recently  been  challenged 
by  much  simpler  processors  which  support  only  the  most  commonly  used  instructions. 
These  processors  are  known  as  reduced  instruction  set  computers  (RISC)  [8].  Tradition- 
ally, RISC  processors: 

• Support  a small  (reduced)  instruction  set  of  simple  instructions  which  represent  the 
most  commonly  used  operations, 

• Process  instructions  at  the  rate  of  one  instruction  per  system  clock  cycle, 

• Have  a large  on  board  register  file  for  instruction  or  data  cache. 

The  SPARC  architecture  is  a scalable  family  of  traditional  RISC  processors  which  was 
developed  commercially.  The  architecture  is  deeply  pipelined  and  depends  heavily  upon 
the  compiler  to  efficiently  map  the  register  set  and  avoid  forbidden  instruction  sequences 
[1].  Recently,  the  meaning  of  the  term  RISC  processor  has  become  quite  blurred  [6].  The 
IBM  System/6000  purports  to  be  a RISC  implementation,  but  supports  184  instructions 
[5],  which  is  a considerably  larger  instruction  set  than  that  of  the  68020  microprocessor 
[7],  which  is  generally  considered  to  be  a CISC  machine. 

The  design  methodology  described  here  is  targeted  for  applications  which  must  be 
implemented  in  a small  amount  of  circuitry  (i.e.  as  a cell  on  a larger  integrated  circuit), 
while  retaining  medium  to  high  levels  of  performance.  The  approach  taken  is  to  implement 


6,3.2 


a system  which  adheres  to  most,  but  not  all,  of  the  design  concepts  of  a traditional  RISG 
machine.  Key  points  of  the  design  approach  investigated  are  listed  below.  They  will  each 
be  described  in  more  detail  later. 

• The  processor  supports  a very  small  (reduced)  instruction  set.  Only  vital  or  fre- 
quently used  operations  are  supported  directly. 

• The  instruction  set  is  orthogonally  partitioned*  As  nearly  as  possible,  bit  fields  in 
instructions  mean  the  same  things  for  all  instructions.  All  addressing  modes  are 
supported  in  the  same  manner  for  all  instructions. 

• All  instructions  are  processed  using  invariant  execution  sequences.  This  means  that 

information  flows  through  the  datapath  in  precisely  the  same  manner  for  all  instruc- 
tions. ---  --  - - - 

» Both  the  datapath  and  the  associated  controller  are  deeply  pipelined.  The  use  of 
invariant  execution  sequences  permits  the  construction  of  very  deep,  yet  simple  pro- 
cessing pipelines. 

• Only  a small  internal  register  set  is  supported.  The  processor  registers  are  memory 
mapped,  allowing  them  to  be  accessed  and  updated  with  general  memory  reference 
instructions. 

• The  support  of  relatively  complex  addressing  modes  is  important  if  the  internal 
register  set  is  small.  Implemented  consistently  across  the  entire  instruction  set,  they 
add  little  to  the  overall  complexity  of  the  machine. 

Though  a specific  processor  was  implemented,  the  design  methodology  followed  may 
be  used  to  implement  a large  number  of  different  RISC-like  processors,  each  with  different 
size-performance  trade-offs. 

2 Execution  Cycle  Pipeline 

The  data  flow  strategy  of  the  microprocessor  is  the  first  item  which  must  be  designed.  This 
includes  data  flow  to  and  from  memory  as  well  as  through  the  datapath  of  the  processor 
itself.  The  performance  required  of  the  processor  drives  the  choices  made  at  this  point. 
Different  cost-performance  ratios  can  be  achieved  through  the  use  of  different  data  flow 
strategies.  A few  of  the  possible  tradeoffs  are  listed  below: 

• Processor  word  size? 

• Separate  address  and  data  busses  used  to  access  memory? 

• Separate  instruction  fetch  and  program  data  stores? 

t Separate  address  generation  and  data  processing  units? 

t Multiple  data  processing  units? 

i Pipeline  depth? 

• Data/instruction  cache? 
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Figure  1:  pP  Execution  Sequence 


• Number  of  internal  data/address  registers? 

• Maximum  number  of  instructions? 

Since  the  design  implemented  was  to  have  moderate  performance  yet  be  compact,  it 
was  decided  to  build  a processor  with  a 16  bit  word,  separate  address  and  memory  busses, 
shared  data  and  instruction  stores,  combined  address  generation  and  data  processing  units, 
a deep  pipeline,  no  cache,  and  a small  set  (4)  of  general  registers.  It  was  further  decided 
that  processor  registers  would  be  memory  mapped,  so  separate  instructions  would  not  have 
to  be  provided  to  access  either  the  processor  registers  or  the  registers  associated  with  the 
accompanying  10  subsystem. 

Though  shallow  pipelines  such  as  was  used  with  RISC  II  [11]  are  relatively  simple  to 
design,  it  was  decided  from  a performance  stand-point  that  the  machine  should  be  deeply 
pipelined.  A deep  pipeline  permits  the  construction  of  a high-throughput  processor,  since 
each  stage  of  the  pipe  can  operate  independently  on  different  portions  of  the  problem 
at  the  same  time.  Deep  pipelines,  however,  have  the  undesirable  characteristic  that  any 
irregularity  in  the  processing  sequence  for  different  instructions  can  lead  to  either  the  need 
for  extremely  complicated  locking  circuitry  [5,1]  or  else  the  definition  of  a large  number  of 
forbidden  instruction  sequences  to  prevent  data  collisions.  It  was,  therefore,  decided  that 
all  instructions  should  share  the  same  (though  perhaps,  a truncated)  processing  sequence. 
The  memory  execution  sequence  finally  decided  upon  is  shown  in  Figure  1.  Several  key 
points  of  this  processing  sequence  are: 

• Each  RAM  access  is  pipelined  two  clock  cycles  deep.  This  greatly  eases  all  timing 
paths  associated  with  RAM  accesses. 

• Data  associated  with  an  instruction  “wraps”  through  the  ALU/Adder  twice.  Once 
to  calculate  the  associated  address  and  once  to  process  data.  This  strategy  keeps  the 
datapath  completely  utilized  at  all  times. 

• Data  processed  during  one  instruction  is  available  for  subsequent  processing  on  the 
very  next  instruction. 

• One  instruction  is  executed  every  two  clock  cycles. 


6.3.4 


Mode 

Invoked 

Description 

Direct 

i = 0 

Effective  Address  is  part  of  instruction. 

Indexed 

s — 1 Offset  ^ 0 

Effective  Address  is  contents  of  referenced 
stack  pointer  plus  signed  offset. 

Stack 

s — 1 Offset  = 0 

If  instruction  implies  a read,  the  referenced 
pointer  is  pre-incremented.  The  Effective 
Address  is  the  new  stack  pointer  contents. 

If  the  instruction  implies  a write,  the  Effective 
Address  is  the  contents  of  the  references  stack 
pointer.  The  stack  pointer  is  post-decremented. 

Figure  2:  fiP  Addressing  Modes 


Register  Select 


StackRelative 


15 


lOBit  Address 


Op  Code  s a/bj^s/t  | ~AAdressJoYYset 


z: 


11  BitAddress 


StackSelect y 


Figure  3:  /jlP  Instruction  Format 


3 Instruction  Types 

Once  data  flow  strategy  has  been  determined,  the  instruction  set  and  addressing  modes  of 
the  processor  must  be  selected.  Here,  a wide  variety  of  possibilities  presents  itself.  Since 
the  processor  implemented  is  intended  for  an  interrupt  driven  environment,  it  was  decided 
that  the  machine  should  be  stack  oriented  and  provide  good  support  for  stack  based  oper- 
ations. The  addressing  modes  summarized  in  Figure  2 were  finally  decided  upon.  Direct 
referencing  of  memory  locations  with  a pointer  contained  in  the  instruction  itself  provides 
simple  access  of  memory  mapped  registers,  global  variables  and  targets  for  normal  jump 
and  jump  subroutine  instructions.  The  indexed  with  offset  mode  provides  support  for  juriip 
tables,  arrays,  stack  oriented  local  variable  access,  and  subroutine  argument  passage.  The 
auto-decrement  and  increment  modes  support  implied  push  and  pop  operations  as  a part 
of  any  instruction,  ease  the  placing  of  arguments  on  the  stack  for  passage  to  subroutines, 
and  allow  the  return  from  subroutine  instruction  to  be  implemented  as  a special  case  of 
the  jump  instruction. 

Figure  4 summarizes  the  instructions  set  which  was  selected.  Each  instruction  can  be 
operated  in  any  addressing  mode.  The  actual  instruction  format  is  shown  in  Figure  3. 
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Op  Code 

Mnemonic 

Register 

Description 

0100 

Id 

a/b 

Load  Register 

1111 

st 

a/b 

Store  Register 

1000 

jsr 

— 

Jump  to  Subroutine 

0000 

jmp 

— 

Absolute  Jump 

0010 

and 

a/b 

Bitwise  And 

0011 

or 

a/b 

Bitwise  Or 

1100 

add 

a/b 

Addition 

1110 

sub 

a/b 

Subtraction 

0110 

not 

a/b 
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Figure  4:  Instruction  Set 


Since  the  arithmetic  unit  already  provides  an  adder  and  a zero  detect  circuit  for  the  im- 
plementation of  the  base  instruction  set,  virtually  no  additional  hardware  in  the  datapath 
was  required  to  implement  the  addressing  modes.  If  a hardware  multiplication  instruction 
had  been  included  in  the  instruction  set,  it  would  have  been  possible  to  utilize  it  during 
address  generation  to  provide  very  sophisticated  support  for  array  accessing. 

The  requirement  that  all  instructions  be  implemented  with  the  same  processing  se- 
quence places  severe  restrictions  on  the  type  of  conditional  statements  that  can  be  pro- 
vided, however.  A test  and  skip  next  instruction  pattern  was  selected  since  it  fits  the 
required  schema  and  was  possible  to  implement  without  disturbing  the  pipelined  flow  of 
instructions.  No  retry  of  instructions  is  necessary,  since  the  results  of  the  test  are  always 
known  in  time  to  abort  the  effects  of  any  subsequent  instruction. 

4 Implementation 

4.1  General 

The  processor  was  designed  using  structured  logic  design  techniques  in  a custom  environ- 
ment. High  operating  speeds  and  compact  layouts  were  achieved  through  the  extensive 
use  of  pass-logic. 

The  use  of  an  orthogonally  partitioned  instruction  set  and  an  instruction  invariant 
processing  sequence  resulted  in  extremely  small  and  simple  control  circuitry.  Consequently, 
the  speed  of  a machine  cycle  is  limited  only  by  delays  in  the  datapath-  not  by  propagation 
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delays  in  the  controller.  - \-V  . y . • 

The  processor  itself  is  a simple  design.  Approximately  three  man  months  were  required 
to  design  the  circuit  and  verify  its  logical  correctness  through  extensive  logic  simulations. 
During  this  time  a software  model  of  the  processor  was  also  written  to  aid  logic  verification 
and  a macro- assembler  was  written  for  software  development.  Four  man  months  were 
required  to  implement  the  layout  and  verify  its  correctness. 

The  processor  was  implemented  in  a 1.6pm  CMOS  process  and  subsequently  shrunk  to  a 
1.0pm  process  due  to  size  considerations.  It  runs  at  a clock  frequency  of  28MHz  under  worst 
case  processing  assumptions,  140deg  C junction  temperature,  and  4.1V  internal  supplies. 
The  processor  was  completely  functional  on  first  silicon.  Under  typical  conditions,  it  should 
run  at  nearly  60MHz,  which  corresponds  to  an  instruction  rate  of  14MIPS  worst  case  and 
30MIPS  typical.  Currently,  the  limiting  speed  path  is  associated  with  the  zero  detect 
circuit-  A redesign  of  this  circuit  would  likely  result  in  yet  higher  system  performance. 

4.2  Control 

The  upper  bits  the  data  bus  are  fed  into  the  control  section  where  they  are  pipelined 
parallel  to  the  data  passing  through  the  datapath.  The  instruction  decode  and  control  of 
the  datapath  is  simple,  since  both  control  and  data  are  pipelined  in  an  equivalent  man- 
ner. The  individual  control  lines  to  the  datapath  are  decoded  directly  from  the  pipelined 
instructions.  The  logic  of  the  control  section  fits  on  one  C sized  sheet  of  logic.  It  consists 
of  four  stages  of  pipeline  registers,  61  NAND  gates  (most  of  which  are  2 to  4 input  gates), 
10  NOR  gates,  and  various  inverter /buffers.  It  contains  no  state-machines  except  those 
required  for  interrupts  and  memory  cycle  stealing  by  the  10  subsystem — and  these  are 
exclusively  single  bit  state-machines. 


4.3  Data  Path 

The  datapath  consists  of  a register  stack  and  a pipelined  Adder  and  ALU.  Figure  6 is 
a signal  flow  diagram  of  the  datapath.  The  M Bus  is  used  for  all  memory  mapped  data 
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transfers.  The  P Bus  drives  the  Address  bus  of  the  RAM  through  a clocked  register,  AR, 
located  in  the  pads.  The  I bus  is  the  data  input  bus  from  the  memory,  and  the  Q bus  is 
the  data  bus  to  the  RAM.  The  I and  Q bus  are  combined  into  a single  bi-directional  bus 
at  the  chip  pads  through  the  MI  and  MO  registers.  (MI  receives  data  from  RAM  during 
a read  and  MO  outputs  data  to  RAM  during  a write.)  The  datapath  operates  as  follows. 
The  instruction  fetch  address  is  driven  from  either  the  program  counter  or  the  secondary 
program  register  onto  the  P (address)  bus.  Two  clock  cycles  later,  the  instruction  arrives 
on  the  I bus,  where  it  is  fed  through  the  ALU.  At  this  time  the  Op  Code  portion  of 
the  instruction  is  stripped  off  and  the  remaining  bits  are  used  to  form  either  an  absolute 
address  or  and  offset  for  the  stack  relative  mode  of  operation.  The  results  are  clocked 
into  the  Pipe  registers.  Next  clock  cycle,  this  address/offset  is  either  passed  unaffected 
through  the  ADDER  (absolute  addressing  mode)  or  added  to  the  contents  of  the  selected 
address  register  (SP  or  TP)  (indirect  addressing  mode),  and  the  results  are  driven  onto  the 
P (address)  bus  through  a tri-state  driver.  If  the  instruction  implies  a write  to  memory, 


6.3.8 


the  appropriate  data  is  driven  onto  the  Q bus  from  either  A,B  or  PC.  If  the  instruction 
implies  a read  the  requested  data  enters  the  datapath  via  the  I bus  two  clock  cycles  later 
and  is  processed  by  the  ALU  and  ADDER  in  succession,  at  which  time  the  results  are 
loaded  into  the  appropriate  register.  An  RTL  description  of  the  data  transfers  comprising 
the  data  processing  sequences  utilized  to  implement  the  entire  instruction  set  is  shown  in 
Figure  5,  It  should  be  noted  again  that  though  this  processing  sequence  is  seven  clock 
cycles  deep,  processing  of  a new  instruction  starts  every  other  clock  cycle. 

The  ALU  and  adder  are  both  implemented  using  pass  logic.  The  ALU  consists  of  a 
single  cell  replicated  16  time,  each  of  which  consists  of  only  23  n-channel  pass  gates  and 
9 inverter /buffers.  The  ALU  performs  all  bitwise  operations  and  provides  a zero  detect 
function  which  is  used  in  the  conditional  skip  instructions,  as  well  as  the  detection  of  the 
auto  increment /decrement  addressing  modes.  The  OpCode  (figure  4)  bit  patterns  were 
selected  such  that  the  upper  bits  of  the  instructions  themselves  become  the  control  lines 
for  the  ALU  with  minimal  remappingr 

The  configuration  selected  to  implement  the  ADDER  is  a modified  transmission  gate 
conditional  sum  scheme  [10].  The  configuration  is  small,  regular,  and  very  fast. 

4*4  IO  Subsystem 

Though  not  a primary  topic  here,  it  should  be  mentioned  that  a complete  IO  subsystem 
was  implemented  and  integrated  with  the  microprocessor  described  here.  It  consisted  of  a 
DMA  subsystem  which  was  responsible  for  the  bulk  transfer  of  data  around  the  chip,  two 
serial  ports  for  low  speed  data  transmission  and  acquisition,  a parallel  port  for  the  transfer 
of  data  to  and  from  an  external  microprocessor,  as  well  as  a prioritized  interrupt /event 
passage  system. 

5 Conclusion 

Present  day  integrated  circuit  fabrication  processes  support  levels  of  integration  adequate 
for  the  construction  of  on-board  microprocessor  based  controllers  which  occupy  only  a 
small  portion  of  the  available  circuit  area.  Such  processors  can  be  readily  designed  for 
different  cost-performance  tradeoffs,  as  required  for  specific  applications.  The  outlay  of 
engineering  time  need  not  be  excessive  and  the  use  of  high-level  languages  for  code  devel- 
opment makes  the  underlying  instruction  set  transparent  to  the  firmware  developer,  and 
eases  code  migration,  development  and  support.  — - 
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Abstract-  High  throughput  is  an  overriding  factor  dictating  system  perfor- 
mance. In  this  paper,  a configurable  data  path  processor  is  presented  which 
can  be  modified  to  optimize  performance  for  a wide  class  of  problems.  The 
new  processor  is  specifically  designed  for  arbitrary  data  path  operations  and 
can  be  dynamically  reconfigured. 

1 Introduction 

High  performance  computers  are  increasingly  in  demand  in  areas  of  weather  forecasting, 
structural  analysis,  etc..  These  often  require  architectures  which  are  different  from  the 
standard  von-Neumann’s  machine  also  called  the  Standard  Stored  Program  Computer. 
The  stored  program  computers  are  designed  to  be  general  purpose  and  is  not  optimized  for 
any  specific  problem.  Fully  customized  architectures  can  be  optimized  to  achieve  maximum 
performance  for  a specific  problem,  but  such  processors  cannot  usually  be  adapted  to 
produce  solutions  to  different  problems. 

Modern  technology  opens  new  dimensions  to  the  designer  of  high  performance  systems, 
by  providing  low  cost  VLSI  modules  which  have  high  computational  throughput.  For  a 
given  functionality,  there  are  two  major  dimensions  of  performance:-  Delay  and  Through- 
put. High  throughput  is  the  most  critical  factor  in  real  time  processing  of  massive  amounts 
of  data,  for  example  in  Digital  Signal  Processing,  Data  Base  operations,  etc..  Since  gen- 
eral purpose  parallel  computers  cannot  offer  real  time  processing  speeds,  special  purpose 
computers  become  the  only  appealing  alternatives. 

Special  purpose  processors  can  be  of  two  types:  1)  Dedicated  Processors  and  2)  Recon- 
figurable/Programmable  Processors.  While  the  former  are  characterized  by  high  processing 
speeds,  inflexibility,  long  design  time  and  high  design  cost,  the  latter  have  advantages  of 
greater  flexibility  in  coping  with  changes  in  the  object  problem,  system  specification  and 
greater  design  economy  with  some  reduction  in  throughput. 

This  paper  presents  a general  purpose  accelerator  which  is  an  enhancement  over  [1], 
that  allows  a variety  of  data  path  configurations,  each  characterized  by  its  own  topology 
of  activated  interconnections  and  hence  applicable  to  a wide  range  of  applications. 

This  configurable  architecture  combines  the  general  purpose  advantages  of  the  stored 
program  machine  with  the  optimization  of  a fully  customized  architecture  to  achieve  max- 
imum performance  for  a broad  class  of  problems.  Every  functional  unit,  data  path  and 

1This  research  was  supported  ( or  partially  supported  ) by  NASA  under  Space  Engineering  Research 
Center  Grant  NAGW-1406. 
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control  structure  can  be  individually  optimized  for  a given  algorithm.  The  architecture 
presented  is  capable  of  operating  in  parallel,  pipelined  or  sequential  modes.  The  user 
configures  the  data  path  through  programming.  The  architecture  can  be  altered  during 
operation  by  reprogramming  or  can  be  initialized  and  fixed  for  dedicated  processing  or  can 
be  attached  to  a host  proC^sbl1. 

The  reconfigurable  processor  differs  from  the  stored  program  computer  in  the  sense 
that  there  is  no  instruction  fetch-decode-execute  cycle.  Moreover,  an  operation  can  be 
executed  every  clock  pulse  in  every  data  path  element. 


2 Processor  Design 

The  data  path  and  the  control  structure  have  been  designed  to  allow  sequential,  pipelined 
or  parallel  operation.  The  processor  is  configured  as  a set  of  identical  data  path  elements 
with  an  overall  controller.  The  top  level  block  diagram  of  this  processor  is  shown  in  Figure 
1.  There  are  two  major  components:  the  data  path,  which  is  an  ALU-register  stack  to 
manipulate  the  data,  and  the  state  controller,  which  controls  the  register  stack.  The  actual 
hardware  configuration  of  the  data  path  is  specified  during  the  programming  of  the  State 
controller. 


2*1  Data  Path 

Each  data  path  element  is  as  shown  in  Figure  2.  Let  there  be  rrt  data  path  elements,  each 
n bits  wide.  Direct  communication  between  each  data  path  element  is  an  essential  feature 
to  achieve  pipeline  or  parallel  operation.  Therefore,  to  allow  all  possible  register  to  register 
communications,  the  data  path  bus  must  be  m x n bits  wide.  This  complete  connectivity 
results  in  the  flexible  reconfigurability,  but  also  limits  the  number  of  data  path  elements. 

Each  data  path  element  consists  of  a Multiplier  Accumulator  (MAC)  which  multiplies 
two  eight  bit  numbers  and  also  adds  two  sixteen  bit  numbers  to  the  product,  (a.b  + c+d). 
This  output  is  stored  in  a globally  accessible  register  of  the  data  path  element.  Also 
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Figure  2:  Data  Path  Element 


contained  within  each  data  path  element  are  two  dedicated  registers,  which  are  used  for 
operations  local  to  that  data  path  element.  The  addition  of  these  dedicated  registers  is 
one  of  the  improvements  over  [1],  This  avoids  the  use  of  an  entire  data  path  element  for 
the  purpose  of  storage  only.  Since  the  area  of  a data  path  element  is  constrained  by  the 
m x n interconnect  bus  the  addition  of  these  registers  should  have  little  impact  on  the 
overall  chip  area. 

The  data  path  also  contains  a set  of  ALU  and  selector  units.  The  ALU  can  implement 
an  arbitrary  arithmetic/logic  operation.  The  operations  of  the  first  logic  unit  is  as  shown 
in  Table  1.  The  selector  unit  selects  the  output  of  its  respective  multiplexors  or  the  output 
of  the  respective  dedicated  register  as  shown  in  Table  2.  The  m to  1 multiplexors  can 
select  the  output  of  any  of  the  m globally  accessible  registers.  The  MAC  operates  on  the 
output  of  the  selector  unit  and  the  logic  unit  to  allow  a mixture  of  arithmetic  and  logic 
functions.  Table  3 shows  example  of  ALU  operations  that  can  be  performed.  Cl  is  the 
carry  in  data  bit. 

Each  globally  accessible  register  is  controlled  as  defined  in  Table  4.  The  dedicated 
registers  are  controlled  as  shown  in  Table  5. 

The  control  word  for  each  data  path  element  structure  is  shown  in  Figure  3.  For  16 
data  path  elements,  the  control  word  is  33  bits  wide. 
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2.2  Control 

The  state  controller  specifies  the  control  words  for  each  data  path  element.  The  hardware 
compiled  control  words  are  contained  in  a control  store  memory  as  depicted  in  Figure 
4.  The  output  of  each  word  from  the  control  store  drives  each  data  path  element.  A 
total  of  536  bits  are  needed  in  each  control  store  word  to  control  the  data  path  elements 
in  a 16  element,  16-bit  data  path  structure.  Program  control  within  the  control  store  is 
implemented  with  a program  location  counter.  The  control  store  can  be  of  an  arbitrary 
depth;  here,  it  is  depicted  as  256  words  deep.  To  perform  a jump  within  the  control  store, 
an  8-bit  jump  address  is  provided  in  each  control  store  word  as  depicted  in  Table  6. 

The  control  store  must  be  specified  prior  to  operation.  This  specification  (hardware 
compilation)  can  be  achieved  through  the  input  port,  16  bits  at  a time.  After  the  control 
store  is  specified,  the  processor  is  ready  to  operate  in  real  time. 


3 Operation 

The  control  store  word  defines  the  operation  and  the  source  of  data  (registers)  for  each 
data  path  element.  The  output  of  any  pair  of  registers  Ri  and  Rj,  i j = 0,1, 2,. ..,15  can  be 
input  to  a data  path  element.  In  general,  the  operation  can  be  specified  as 

Ri[ALUoperation]Rj  —*  Rk  (1) 
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Table  6:  Control  Store  Word 
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which  means  that  the  result  of  an  ALU  operation  upon  the  contents  of  any  register  pair 
Ri  and  Rj  can  be  placed  into  register  Pj..  This  is  true  for  any  and  all  registers  in  the  data 
path  and  all  operations  occur  simultaneously.  Since  each  data  path  element  can  function  as 
an  independent  element,  the  entire  data  path  can  be  configured  to  operate  in  the  sequential, 
pipelined  or  parallel  modes.  The  controller  also  specifies  the  next  state  of  the  controller 
and  provides  handshaking  for  external  input  and  output  control  functions.  The  memory 
can  be  ROM  for  dedicated  processing  or  RAM  or  EPROM  where  field  programmability 
is  desired.  Depicted  in  Figure  4 is  a feature  where  the  control  store  can  be  programmed 
via  the  input  data  port.  The  entire  control  store  can  be  initialized  in  a 16  bit  word  serial 
manner. 

The  control  store  is  specified  prior  to  operation.  Once  the  control  store  is  specified,  the 
processor  executes  at  the  rate  specified  by  the  system  clock.  With  static  cells,  the  system 
clock  can  range  from  d.c.  to  the  maximum  allowable  by  the  IC  process. 

3.1  Examples 

Consider  the  following  Digital  Filter  examples  to  illustrate  the  use  of  this  processor.  The 
general  second  order  difference  equation  is 

y{n)  = a0x(n)  + aix(n  - 1)  + a2x(n  - 2)  , . 

-/3xy(n  - 1)  - f32y(n  - 2).  ^ ' 

This  implements  an  HR  filter.  For  an  FIR  filter  the  equation  simplifies  to 

y(n)  — a0x(n)  + axx(n  - 1)  + a2x(n  - 2).  (3) 

To  implement  the  FIR  filter  in  the  architecture  presented  in  this  paper,  let  jDr61  contain 
a0)  Dr$i  contain  ax  and  DrAX  contain  a2  as  shown  in  the  simplified  block  diagram  of  Figure 
5.  Also  let  i?4,  i?5,  R6  and  R0  be  initially  reset.  The  operations  can  be  described  in  a 
register  transfer  language  where  each  Pi  is  a control  state  that  defines  the  data  transfers 
that  take  place  when  P{  is  active. 

Pq : Data  — > R0 

Pi-  Rq  ■ Dr$i  + f?5  — » Rq  • Dr^i  + P4  — ► R$} 

Rq  • Dv^i  — * i?4 

Assuming  that  constants  are  preloaded  into  the  registers  and  that  2’s  complement 
arithmetic  is  used,  the  control  word  for  each  data  path  element  (Ri)  is  shown  in  Table  7. 
Each  control  state  Pi  represents  one  parallel  control  word;  the  portion  of  the  control  word 
for  each  R{  is  shown  on  a series  of  lines  for  the  sake  of  simplicity.  Register  control  for  all 
other  registers  not  shown  in  Table  7 , the  register  control  bits  in  their  control  words  are 
00,  indicating  no  operation,  for  that  control  state, 

There  are  a total  of  5 operations  that  occur  in  2 clock  pulses.  If  this  processor  operated 
at  20  MHz,  50  million  operations  per  second  would  be  performed. 
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State 

Reg 

MuxA 

MuxB 

MuxC 

MuxD 

p b 

R 0 

A 

- 

- 

- 

Pi 

R 2 

Ro 

- 

Re 

- 

Re 

Ro 

- 

R 7 

- 

R 7 

Ro 

- 

- 

- 

I i?10 

Ro 

- 

- 

- 

R 9 

Ro 

- 

R2  1 

Rs 

R8 

Ro 

- 

- 

- 

State 

Reg 

ALUl 

ALU2 

SCI 

SC2 

Cl 

RC 

p 0 

R 0 

1111 

0000 

00 

11 

0 

01 

Pi 

R 2 

0011 

0000 

10 

00 

0 

01 

Re 

0011 

0000 

10 

00 

0 

01 

R 7 

0011 

0000 

10 

11 

0 

01 

Rio 

1111 

0000 

00 

11 

0 

01 

Ro 

0011 

0101 

10 

00 

0 

01 

R 8 

0011 

0000 

10 

11 

0 

01 

Table  8:  HR  Filter  Control  Word  Programming 


Dr  21 

« 0 

Dr  6i 

<*1 

Dr  71 

02 

Droi 

Pi 

Dr  si 

Pi 

Rio 

y(n) 

R 9 

y(n-i) 

R 8 

y(n-2) 

For  an  HR  filter,  consider  the  following  register  assignment.  A register  transfer  language 
description  of  the  operations  to  implement  the  HR  filter  equation  would  be 
P0 : Data  — * Ro 

Pi’.  R0  • Dr  21  "b  f?6  — * Riy  RO  ’ Dr$i  + R7  ♦ i?6> 

Ro  • Dr 71  — > f?7,  Dr  si  • Ro  R&y 
i?2  + R8  + -^91  • R9  Roy  R9  — * Rio. 

Assuming  again  that  constants  are  preloaded  into  the  registers  as  2’s  complement  num- 
bers and  that  2’s  complement  arithmetic  is  used.  The  control  word  for  each  data  path 
element  is  shown  in  Table  8.  There  are  a total  of  9 operations  that  occur  in  2 clock  pulses; 
operating  at  20  MHz,  90  million  operations  per  second  would  be  performed. 


4 Summary 

A new  architecture  has  been  presented  which  allows  for  sequential,  pipelined,  or  parallel 
operation.  A control-data  path  structure  consists  of  m identical  data  path  elements.  The 
data  path  elements  can  be  independently  specified  to  allow  parallel  or  pipelined  operation. 
The  control  of  the  data  path  is  specified  by  the  control  store  memory.  The  processor  can  be 
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a dedicated  stand  alone  machine  or  attached  to  a general  purpose  processor.  As  an  attached 
processor,  it  can  be  dynamically  modified  to  assume  different  data  path  configurations  if  the 
control  store  is  RAM  based.  It  is  proposed  that  this  architecture  is  a first  step  in  producing 
a processor  that  allows  the  digital  designer  the  same  kind  of  flexibility  in  altering  data  path 
configurations  as  field  programmable  gate  arrays  offer  alternatives  to  the  logic  designer. 

Ackribwledgement  This  research  was  supported  in  part  by  NASA  under  grant  NAGW- 
1406  and  grant  NAG5-1043. 
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1 Introduction 

Measuring  similarities  between  large  sequences  of  genetic  information  is  a formidable  task 
requiring  enormous  amounts  of  computer  time.  Geneticists  claim  that  nearly  two  months 
of  CRAY-2  time  are  required  to  run  a single  comparison  of  the  known  database  against 
the  new  bases  that  will  be  found  this  year,  and  more  than  a CRAY-2  year  for  next  year  s 
genetic  discoveries,  and  so  on. 

The  DNA  IC,  designed  at  HP-ICBD  in  cooperation  with  the  California  Institute  of 
Technology  and  the  Jet  Propulsion  Laboratory,  is  being  implemented  in  order  to  move 
the  task  of  genetic  comparison  onto  workstations  and  personal  computers,  while  vastly 
improving  performance. 

The  chip  is  a systolic  (pumped)  array  comprised  of  16  processors,  control  logic,  and 
global  RAM,  totaling  400,000  FETS.  At  12  MHz,  each  chip  performs  2.7  billion  16  bit 
operations  per  second.  Using  35  of  these  chips  in  series  on  one  PC  board  (performing 
nearly  100  billion  operations  per  second),  a sequence  of  560  bases  can  be  compared  against 
the  eventual  total  genome  of  3 billion  bases,  in  minutes  — on  a personal  computer. 

While  the  designed  purpose  of  the  DNA  chip  is  for  genetic  research,  other  disciplines 
requiring  similarity  measurements  between  strings  of  7 bit  encoded  data  could  make  use 
of  this  chip  as  well.  Cryptography  and  speech  recognition  are  two  examples. 

A mix  of  full  custom  design  and  standard  cells,  in  CMOS34,  were  used  to  achieve  these 
goals.  Innovative  test  methods  were  developed  to  enhance  controllability  and  observability 
in  the  array.  This  paper  describes  these  techniques  as  well  as  the  chip’s  functionality. 

This  chip  was  designed  in  the  1989-90  timeframe. 


2 Goals 

The  main  project  goal  was  to  produce  a device,  for  a larger  system,  that  would  prove 
the  new  computing  architecture.  This  meant  integrating  as  much  functionality  as  was 
reasonable,  with  respect  to  cost.  This  includes  as  many  processors,  as  much  RAM  per 
processor,  and  as  many  other  desired  functions  as  possible.  Performance  was  a lesser 
concern,  largely  due  to  disk  access  being  the  initial  system  performance  limiter,  and  also 
because  the  architecture  provides  the  main  performance  breakthrough.  Limiting  power 
dissipation  was  a lesser,  but  real  concern  as  well. 
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At  the  outset  of  the  project,  a standard  cell  implementation  was  envisioned  that  might 
cb'htain  10  processors,  each  with  32  bytes  of  RAM  on  a 1 cm  square  device.  In  the  end,  a 
ciistom  solution  provided  16  processors,  each  with  double  the  functionality  and  128  bytes 
of  RAM,  orl  a roughly  1 cm  x 1.2  cm  device. 


3 Functionality 

The  primary  function  of  the  system,  comprised  largely  from  a series  of  DNA  chips,  is  to 
locate  regions  of  similarity  between  strings  of  genetic  bases,  represented  in  ASCII  by  the 
characters;  A,  C,  G,  and  T.  A terse  description  of  the  method  for  achieving  this  end,  is  as 


follows.  ^ ^ 

First,  the  primary  string  is  convert  by  an  external  processor,  such  that  each  character 
ifl  that  string  is  replaced  by  four  match  scores , one  Tor  each  of  the  four  possible  characters 
that  it  may  be  compared  against  in  the  secondary  string.  These  scores,  for  each  charaHer 
in  the  primary  string,  are  loaded  into  the  local  RAMs  of  each  successive  processor,  such 
that  each  RAM  contains  the  four  bytes  representing  the  lour  possible  scores  caused  by 
interaction  of  characters  in  the  secondary  string  with  that  single  character  of  the  primary 
string.  Each  processor’s  RAM  is  128  bytes,  enough  to  accommodate  full  ASCII.  Now,  each 
processor  behaves  as  the  agent  of  one  character  in  the  primary  string;  hence,  the  length  of 
the  primary  string  is  initially  limited  to  the  the  number  of  processors  in  the  system  (16  x 
number  of  DNA  chips).  Through  software  the  length  can  be  expanded  without  limit,  by 
method  of  partitioning  the  string  and  using  sufficient  overlap. 

Secondly,  a number  of  constants  are  loaded  into  each  chip  by  the  external  system 
processor,  such  as;  chip  location  within  the  pipeline,  how  to  deal  with  gaps  that  naturally 
occur  within  genetic  sequences,  and  others, 

At  this  point,  the  pipeline  begins  to  function.  The  secondary  string  enters  the  front  of 
the  pipeline  and  is  passed  from  one  processor  to  the  next  on  each  successive  clock..  Each 
character  within  this  string  is  used  as  an  address  to  the  local  RAM  of  the  current  processor 
visited  by  that  individual  character.  By  this  method,  the  appropriate  score  is  retrieved 
from  the  local  RAM  for  the  interaction  between  the  characters  of  the  two  separate  strings. 


Along  with  the  former  occurring,  the  follow  three  equations  are  processed  within  that 
same  clock  cycle,  in  each  processor.  (Smith  and  Waterman,  Best  Subsequence  Alignments 
Algorithm .): 


— max{0,  «s(<x*,  bj ) , 


with 

Eitj  = - (ue  + vE),Eiyi- i - vE} 

Fitj  = max{Hi-.hj  - ( uF  + vF),Fi-iyi  - vF} 

where  JF,  B and  b are  pipelined,  E is  fed  back  within  the  processor,  uE^vE^uF}  and 
VF  are  constants  dealing  with  sequence  gaps,  and  s(ai,bj)  is  the  score  produced  by  the 
intersection  of  the  two  characters  from  the  different  strings. 


* hi  mm ii ii i in 
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Additionally,  each  processor  monitors  its  H value,  which  represents  quantitatively  the 
similarity  between  strings,  and  detects  its  peak.  If  this  peak  exceeds  a programmed  thresh- 
old, this  value,  as  well  as  the  location  of  this  occurrence  within  the  secondary  string,  is 
piped  along  through  the  remaining  processors  on  the  given  chip  and  then  stored  in  the 
chips  global  FIFO  RAM.  The  range  of  this  location  value  limits  the  secondary  string  to 
4 million  characters.  However  with  use  of  external  software,  an  unlimited  string  can  be 
applied. 

This  process  occurs  simultaneously  in  all  processors  on  each  chip,  until  the  entire  sec- 
ondary string  has  been  piped  all  the  way  through,  or  the  external  system  processor  in- 
terrupts. The  equations  and  peak  detector  are  implemented  with  five  adders  and  seven 
comparators;  the  values  (up  + t >e)  and  (up  + vp)  are  provided  as  constants. 

When  a value  has  been  stored  in  the  global  FIFO  RAM,  the  chip  signals  the  external 
system  processor,  and  at  the  system  processor’s  convenience,  reads  that  data  from  the  chip 
into  a global  system  RAM.  This  is  the  raw  similarity  information  desired  from  the  system. 
Of  course,  if  any  chip’s  FIFO  nears  overflow,  a system  interrupt  is  issued,  by  that  chip,  to 
pause  the  entire  pipeline  until  the  RAMs  can  be  emptied. 

4 Design  Challenges 

Technical  challenges  included;  performance,  power,  and  density  concerns,  as  well  as  prob- 
lems pertaining  to  pad  switching  noise  and  testability. 

By  custom  designing  most  circuitry  for  near  maximum  density,  lower  power  and  higher 
performance  fell  out  as  by-products.  Most  N channel  devices  in  the  pipeline  were  sized  at 
5 fi  wide  and  1 p long.  The  small  devices  reduced  power  consumption,  as  well  as  greatly 
improved  the  circuit  density.  By  careful  floorplanning  to  minimize  interconnect  capaci- 
tance, chip  performance  was  improved  over  that  of  a standard  cell  approach.  One  of  the 
key  sub-modules  within  the  processor  is  a 16  bit  adder.  At  426^  by  215/i  (896  FETs), 
the  custom  adder  is  one  seventh  the  size  of  its  standard  cell  implementation;  at  4 mW,  its 
power  consumption  is  one  sixth;  and  at  11  nS,  its  performance  is  improved  by  more  than 
two  fold  over  the  standard  cell  solution. 

While  the  conservative  design  goal  of  12  MHz  does  not  seem  worthy  of  CMOS34, 
consider  two  of  the  paths  to  be  traversed  in  the  83  nS  cycle;  1)  Register  — 16  bit  signed 
addition  — 6 x (16  bit  signed  compare  and  select  greatest)  — 5 gates  — Register,  and 
2)  Register  — address  RAM  — 16  signed  bit  addition  — 3 x (16  bit  signed  compare  and 
select  greatest)  — 5 gates  — Register. 

The  next  area  of  concern  was  with  pad  switching  noise.  This  resulted  from  being  bound 
to  a 208  lead  Quad  Flat  Pack,  with  190  signal  pins,  leaving  only  18  power  pads.  While 
having  a full  synchronous  design  helped  in  some  aspects,  it  also  created  the  possibility 
of  having  all  65  pipeline  outputs  and  all  32  global  data  bus  pads  switch  simultaneously. 
Additionally,  several  other  system  pads  could  be  switching  as  well.  It  is  helpful  that  all 
pipeline  input  signals  are  latched  on  the  rising  elk,  while  the  pipeline  outputs  do  not  change 
until  a number  of  gate  delays  later.  However,  several  volts  of  supply  noise  could  easily  be 
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generated  by  switching  the  standard  pads,  causing  erroneous  inputs  and  outputs  on  the 
global  system  pads.  Additionally,  latchup  was  a concern. 

Thfc  solution  was  to  create  three  modified  pads  and  use  an  expanded  power  distribution 
scheme.  All  pad  input  sensors  (TTL  level  Schmitt)  were  connected  to  one  Vdd  pad  and 
two  Gnd  pads;  124  inputs  total.  The  output  drivers  for  pipeline  outputs,  and  global  data 
bus  were  connected  to  4 Vdd  pads  and  5 Gnd  pads;  97  outputs  total.  The  remaining  4 
global  system  pads,  capable  of  causing  system  interrupts,  were  connected  to  their  own 
isolated  pair  of  power  pads.  Lastly,  the  chip  core  and  output  pad  stage- ups  were  placed 
on  two  pairs  of  power  pads. 

While  this  helped  to  isolate  the  noisy  circuitry  from  sensitive  circuitry,  the  noise  spikes 
on  the  dirty  power  bus  from  output  switching,  were  still  too  high.  Several  things  were  done 
to  help  reduce  the  noise.  First,  the  drivers  for  the  65  pipeline  outputs  were  greatly  reduced 
so  that  the  rise  time  on  the  Sentry  15’s  60  pF  load  would  be  40  nS  worst  case.  These  outputs 
will  normally  see  only  7-10  pF  in  the  product,  as  the  output  pad  communicates  only  with 

the  neighboring  chip’s  input  pad. 

The  globed  data  bus  pads  created  another  problem  in  that  their  loading  depended 
directly  on  how  many  DNA  chips  were  placed  in  the  system,  as  they  all  connect  directly 
to  one  another.  In  the  initial  system,  this  load  would  be  275  pF . Since  the  32  data  bus  pads 
were  by  far  the  largest  contributor  to  noise,  and  because  their  load  could  vary,  another 
scheme  was  employed.  The  data  pads  each  contain  two  sets  of  output  drivers,  one  small 
and  one  large.  A signal  to  the  pad  determines  whether  the  large  drivers  are  used  in  parallel 
with  the  small  ones,  or  whether  the  small  drives  are  used  alone.  A control  register  bh  is 
used  to  turn  off  the  larger  drivers,  in  the  event  that  the  data  bus  had  a small  capacitive 
loading,  or  that  the  noise  from  the  larger  drivers  was  simply  unacceptable  (in  which  case, 
the  system  elk  rate  would  have  to  be  reduced).  The  rise  time  for  a 275  pF  load  with  the 
large  drivers  is  20  nS,  worst  case,  and  75  nS  without  those  drivers. ^ ^ 

Additionally,  care  was  taken  to  turn  on  output  drivers  slowly;  about  a 3 jiS  to  4 nS 
rise;  time  on  the  driver’s  gate.  Skewing  of  data  to  the  pad  drivers  also  jielped  to  reduce 
the  switching  noise. 

The  last  major  challenge  was  in  the  area  of  test.  Standard  methods  for  testing  the  part 
in  its  normal  operating  mode  were  seen  to  be  near  impossible.  The  controllability  and 
observability  of  nodes  deep  in  the  pipeline  of  16  processors  was  very  near  zero.  Since  each 
processor  interfaces  to  the  previous  processor  through  a register  bank,  scan  testing  sgeiped 
to  be  the  obvious  solution.  However,  with  about  150  register  bits  in  each  processor,  and 
a total  of  16  processors  and  one  additional  pipeline  register  buffer,  the  full  scan  vector 
length  would  be  over  2500  bits.  With  a Sentry  15  limited  to  256k  total  vectors,  this  would 
provide  only  100  scan  vectors,  with  no  vector  memory  available  for  testing  the  22k  bytes 
of  RAM,  nor  the  control  logic.  Several  thousand  scan  vectors  were  desired  for  testing  the 

^processors'.  • :Iv ~ IZ'l.S  '“V.r;,;  T ; . .....  ...  -lr  --  , 

The  solution,  was  to  take  advantage  of  the  fact  that  all  of  the  processors  are  identicalj 

and  therefore  given  the  same  input  scan  vector,  will  produce  the  gxact  same  resultant 
output  vector  in  the  register  bank  of  the  pipeline’s  next  stage.  The  method  then,  is 
to  scan  in  a vector  that  is  only  one  processor  register  bank  long  (150  bits),  into  all  16 
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processors  simultaneously.  After  clocking  the  device  once  in  normal  mode,  as  in  standard 
scan  methodology,  the  16  resultant  vectors  are  scanned  out  of  the  processors  and  onto 
16  independent  lines  of  the  global  data  bus,  readable  from  the  chip’s  data  bus  pads. 
Additionally  for  testing  in  the  product,  all  16  scan  outputs  are  connected,  on  chip,  to  a 
equality  function.  If,  when  in  test  mode,  all  of  the  16  scan  outputs  are  not  equal,  then 
an  error  pin  is  activated  for  notification  of  the  external  system  processor.  The  system 
processor  can  then  set  another  pin  on  the  errant  chip  so  that  the  pipeline  data  coming  on 
chip  is  diverted  around  the  16  processors,  to  the  final  buffer  register,  thereby  fixing  the 
whole  pipeline  at  the  cost  of  those  16  processors. 

5 Results 

First  prototypes  of  the  DNA  chip  were  tested  in  Spring  of  1990.  Several  timing  problems 
were  found  in  chip  functions  that  had  not  been  completely  simulated  by  the  designers. 
Second  prototypes  produced  perfect  parts.  JPL  currently  has  a circuit  board  16  DNA 
chips  (a  total  of  256  processors)  running  and  interfaced  to  a workstation. 
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High-Performance  Multiprocessor  Architecture 
for  a 3-D  Lattice  Gas  Model  1 

F.  Lee,  M.  Flynn  and  M.  Morf 
Computer  Systems  Laboratory 
Stanford  University,  Stanford,  CA  94305 

Abstract - The  lattice  gas  method  has  recently  emerged  as  a promising  discrete 
particle  simulation  method  in  areas  such  as  fluid  dynamics.  We  present  a very 
high-performance  scalable  multiprocessor  architecture,  called  ALGE,  proposed 
for  the  simulation  of  a realistic  3-D  lattice  gas  model,  Henon’s  24-bit  FCHC 
isometric  model.  Each  of  these  VLSI  processors  is  as  powerful  as  a CRAY- 2 for 
this  application.  ALGE  is  scalable  in  the  sense  that  it  achieves  linear  speedup 
for  both  fixed  and  increasing  problem  sizes  with  more  processors. 

The  core  computation  of  a lattice  gas  model  consists  of  many  repetitions 
of  two  alternating  phases:  particle  collision  and  propagation . Functional  decom- 
position by  symmetry  group  and  virtual  move  are  the  respective  keys  to  efficient 
implementation  of  collision  and  propagation. 

1 Introduction 

High  performance  computing  has  become  a vital  enabling  force  in  the  conduct  of  science 
and  engineering  research  and  development.  In  particular,  simulations  based  on  computa- 
tional fluid  dynamics  are  less  costly  and  much  faster  than  complex  wind  tunnel  tests.  In 
the  past  few  years,  the  lattice  gas  method  [3]  has  emerged  as  an  attractive,  robust  and 
promising  discrete  particle  simulation  method  for  fluid  flow  simulations  with  complicated 
boundary  conditions,  that  are  difficult  or  impossible  to  solve  with  other  methods.  Var- 
ious standard  fluid  dynamical  equations,  including  the  Navier-Stokes  equations,  can  be 
obtained  from  lattice  gas  models  after  proper  limits  are  taken  [4]. 

The  core  computation  of  a lattice  gas  model  is  inherently  suitable  for  execution  on 
scalable  parallel  computing  systems,  without  requiring  floating  point  operations.  Increas- 
ing amounts  of  computing  power  is  needed  to  solve  large  scale  simulation  problems.  It 
is  believed  that  simple  and  practical  application- specific  computers  (or  co-processors)  can 
achieve  performance  orders  of  magnitude  higher  than  existing  “general-purpose”  supercom- 
puters, that  invariably  focus  on  floating  point  operations.  This  belief  has  been  confirmed  in 
the  case  of  two-dimensional  simulation,  but  not  in  the  case  of  three-dimensional  simulation, 
which  is  much  more  important  and  challenging. 

All  existing  special-purpose  lattice  gas  computers  such  as  CAM-6  [12],  RAP1,  RAP2  [1], 
and  LGM-1  [6],  deal  with  two-dimensional  lattice  gas  models.  Until  today,  only  one 
other  design,  CAM-8  [10],  proposed  by  Margolus  and  Toffoli,  attempts  to  deal  with  three- 
dimensional  models,  but  this  proposal  is  limited  to  16  or  fewer  state  bits  per  lattice  node. 


1This  work  was  supported  by  NASA  Ames  Research  Center  under  contract  NAGW  419. 
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Yet  we  need  to  simulate  models  with  24  bits  or  more  per  node  in  order  to  achieve  real- 
istic results  in  studying  complex  phenomena  such  as  turbulent  flow  [2],  ALGE  is  to  our 
knowledge  the  first  special-purpose  machine  proposed  to  tackle  a realistic  3-D  lattice  gas 
model. 

2 Lattice  Gas  Models 

In  order  to  keep  this  paper  self  contained,  we  repeat  some  of  the  material  from  our  previous 
publication  [7],  on  which  this  paper  is  based. 

In  a lattice  gas  model,  space  and  time  are  discretized.  Time  is  divided  into  a se- 
quence of  equal  time  steps,  at  which  particles  reside  only  at  the  nodes  of  the  lattice.  The 
evolution  consists  of  two  alternating  phases:  (i)  propagation : during  one  time  step,  each  i 

particle  moves  from  one  node  to  another  along  a link  of  the  lattice  according  to  its  veloc-  l 

ity;  (ii)  collision : at  the  end  of  a time  step,  particles  arriving  at  a given  node  collide  and  i 

instantaneously  acquire  new  velocities.  The  properties  of  the  lattice  not  only  govern  the 
propagation  phase,  but  also  significantly  constrain  the  collision  phase,  because  the  collision 
rules  must  have  the  same  symmetries  as  the  lattice  [4].  ‘."v" — •.  ] 

The  state  of  a node  can  be  denoted  by  the  bit  vector  b = (&i, . . . 9 6n),  where  6,  = 1 if  a 
particle  with  the  corresponding  velocity  v*  is  present  2 , and  = 0 otherwise.  Let  b(x,f), 
and  b'(x,f)  be  the  states  of  the  node  at  position  x and  time  i before  and  after  the  collision 
respectively.  The  collision  phase  specifies  that,  for  all  x and  f, 

b'(x,  t)  C(b(x,<)) 

where  C is  a deterministic  or  non-deterministic  n-input  n-output  boolean  collision  function. 

The  propagation  phase  specifies  that,  for  all  x and  <, 

bi(x  + v1,  t + 1)  = &'(x,f) 

An  obstacle  such  as  a plate,  a wedge  or  an  airplane  wing  is  decomposed  into  a series 
of  continuous  links  which  approximately  geometrical  shape.  At  nodes  which  represent 
an  obstacle,  particles  are  either  bounced  back  or  undergo  specular  reflection.  This  can 
be  handled  by  adding  one  or  more  obstacle  bits  to  the  state  of  a node  and  adjusting  the 
collision  function  appropriately^  --- ------- 

Before  simulation,  the  states  of  the  nodes  are  initialized  according  to  the  initial  distri- 
bution of  particle  densities  and  velocities.  After  simulation,  nodes  within  a volume  of  tens 
of  nodes  on  each  side  are  averaged  to  compute  the  macroscopic  density  and  momentum. 

There  are  two  types  of  boundary  conditions  on  the  lattice  edges  we  are  concerned  with. 

The  first  type  is  the  periodic  boundary  condition : the  particles  exiting  from  one  edge  are 
reinjected  into  the  other  edge  in  the  same  direction.  The  second  type,  related  to  a wind- 
tunnel  experiment,  consists  in  providing  a flux  of  fresh  particles  on  one  side  of  the  lattice 
and  allowing  an  output  flux  on  the  other  side.  In  this  paper,  we  focus  on  the  first  type  of 

^In  this  paper,  Roman  and  Greek  indices  refer  respectively  to  labels  and  components. 
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Figure  1:  The  pseudo-four-dimensional  FCHC  model.  Only  the  neighbors  of  one  node  are 
shown  as  connected. 

boundary  condition,  because  it  is  basic:  it  requires  no  special  treatment  for  nodes  on  the 
edges , as  there  are  no  edges  in  a wraparound  lattice  space.  The  second  type  can  be  dealt 
with  as  a simple  extension. 

2.1  Three-Dimensional  Lattice 

The  particular  lattice  we  are  most  interested  in  is  the  FCHC  lattice  used  in  three  di- 
mensional simulations  [4,11].  A FCHC  (face-centered  hypercubic)  lattice  consists  of  those 
nodes,  which  are  the  points  with  signed  integer  coordinates  (^i,  z2,  £3,  £4)  ==.x  such  that 
the  sum  x\  + x2  + x3  + xa  is  even.  Each  node  x is  linked  to  its  24  nearest  neighbors  x'  such 
that  the  vector  x'  — x corresponds  to  one  of  the  following  24  values: 

(±1, ±1,0,0),  (±1,0, ±1,0),  (±1,0,0,±1), 

(0,  ±1,  ±1, 0),  (0,  ±1,0,  ±1),  (0, 0,  ±1,  ±1).  (3) 

These  24  nearest  neighbors  form  a regular  poly  tope.  With  time  steps  normalized  to  1,  the 
vectors  in  (3)  are  also  the  24  possible  velocities  of  the  particles  arriving  at  or  leaving  from 
a node. 
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The  pseudo  four-dimensional  FCHC  model  is  derived  by  projecting  the  four-dimensional 
FCHC  lattice  to  three  dimension  so  that  the  fourth  dimension  has  a periodicity  of  1.  Each 
node  of  a regular  cubic  lattice  is  a node  in  the  model.  Figure  1 shows  the  neighborhood  of  a 
node:  along  the  gray  links,  connecting  to  12  neighbors,  at  most  one  particle  can  propagate, 
with  component  v4  = 0;  along  the  thick  black  links,  connecting  to  6 neighbors,  up  to  two 
particles  can  propagate,  with  components  v4  — ±1. 

2.2  Isometric  Collision  Rules 

Associated  with  the  FCHC  lattice  is  the  isometry  group  G of  order  1152.  Roughly  speaking, 
an  isometry  is  a symmetry  operation  such  as  rotation  and  reflection  about  the  origin. 

The  isometric  collision  rules  [5]  require  that 

1.  Every  collision  is  an  isometry:  the  output  velocities  are  images  of  the  input  velocities 
in  an  isometry. 

2.  The  isometry  depends  on  the  momentum  only:  the  momentum  of  the  input  state  is 
computed,  and  then  normalized  by  taking  advantage  of  the  symmetries,  and  finally 
used  for  classification. 

3.  The  isometry  is  randomly  chosen  among  all  optimal  isometries:  this  is  why  non- 
determinism comes  into  play.  (An  optimal  isometry  is  one  which  minimizes  the 
viscosity  of  the  lattice  gas,  so  that  higher  Reynolds  numbers  can  be  reached.) 

3 System  Overview 

This  is  an  updated  version  of  the  design  of  ALGE  as  presented  in  [7].  The  machine 
is  organized  as  an  array  processor,  which  serves  as  a special  purpose  high  performance 
computing  engine  to  a host  computer.  The  host  computer  downloads  the  problem  (data) 
into  the  engine  and  offloads  the  engine-produced  solution.  The  host  provides  the  user 
interface  to  the  computing  engine  and  performs  the  pre-processing  and  post-processing 
phases  of  the  simulation. 

Figure  2 shows  a 4x4  configuration  of  ALGE.  The  processors  are  connected  as  the 
nodes  of  a 2-D  toroid.  Each  identical  processor  3 (P)  has  its  own  local  memory  (M).  In  a 
simulation  the  3-D  problem  space  is  decomposed  into  non-overlapped  equal-sized  partitions 
such  that  nodes  with  the  same  Z-coordinates  map  to  the  same  memory  space,  and  adjacent 
partitions  map  to  adjacent  memory  spaces. 

4 Processor  Architecture 

Figure  3 shows  the  functional  block  diagram  of  the  processor.  The  processor  contains 
the  following  units:  several  collision  units,  an  address  generator,  a transposer,  a switch,  a 


3It  may  contain  several  processing  elements  (PE)  as  referred  in  [7], 
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Figure  2:  A 4x4  configuration  of  ALGE 

memory  address  register  (MAR),  a memory  data  register  (MDR)  and  a control. 

The  collision  units  can  be  viewed  as  the  “arithmetic”  units  of  a “superscalar”  processor. 
Each  unit  is  capable  of  computing  one  collision  function  per  cycle.  The  address  generator 
contains  a register  file  and  some  modified  adders.  It  is  responsible  for  generating  the 
proper  address  sequences  for  reading  and  writing  data  from  and  to  memory.  Each  local 
memory  can  supply  one  word  of  k bits  per  cycle.  The  n bits  of  any  given  node  is  stored  at 
a different  word  address.  Hence,  it  takes  n cycles  to  read  all  n bits  of  each  of  the  k nodes. 
The  transposer  is  a two-way  shift  register  array.  Actually,  there  are  two  transposing  buffers 
so  that  one  can  be  emptied  (written  back  to  memory)  and  filled  (read  from  memory),  while 
the  other  is  accessed  by  the  collision  units.  The  switch  exchanges  data  with  neighboring 
processors  if  necessary.  At  any  time,  the  processor  either  reads  or  writes.  Since  the 
procedure  is  deterministic,  and  the  access  sequence  is  data  independent,  all  operations 
(AG,  RD,  etc.)  are  deeply  pipelined  in  order  to  achieve  maximum  throughput. 

The  parameter  k is  is  the  number  of  partitions  mapped  to  a processor.  The  optimal 
choice  of  k depends  on  n,  the  number  of  bits  per  node,  the  delay  through  the  switch,  and 
the  number  of  I/O  pins  and  area  of  the  VLSI  implementation.  Some  typical  numbers  we 
are  considering  are:  n = 25  (1  obstacle  bit),  k — 192  for  a processor  with  4 collision  units. 

4.1  Collision  Unit 

The  properties  of  a lattice  not  only  govern  the  propagation  phase,  but  also  significantly 
constrain  the  collision  phase,  because  the  collision  rules  must  have  the  same  symmetry  as 
the  lattice  [4].  How  the  underlying  symmetry  group  of  a lattice  gas  model  can  be  exploited 
to  derive  compact  and  high  performance  processing  elements  to  handle  collision  functions 
of  potentially  exponential  complexities  (0(n2n))  was  posed  as  a major  challenge  in  this  area 
of  research  (see  the  Preface  of  [3]).  The  FCHC  isoiqetric  model  proposed  by  Henon  [5]  was 
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Figure  3:  Functional  block  diagram  of  the  processor 

the  first  real  24-bit  three-dimensional  model  with  a detailed  specification  of  an  optimized 
non-deterministic  collision  function.  Therefore,  it  was  chosen  as  our  first  case  study.  A 
VLSI  architecture  for  the  FCHC  isometric  model  has  been  designed  and  implemented  as 
an  ASIC.  We  have  shown  that  a 4000  gate  chip  can  replace  the  equivalent  of  4.5  billion 
bits  (rather  than  384  million  bits  due  to  non- determinism)  of  a lookup  table  used  to  solve 
this  problem.  Because  the  architecture  is  derived  by  considering  the  symmetry  properties 
rather  than  by  brute  force  logic  synthesis,  it  can  be  generalized  to  other  classes  of  lattice 
gas  models.  We  present  the  main  ideas  in  this  paper.  (Please  see  [8,9]  for  more  details). 

Henon’s  isometric  algorithm  [5]  shows  how  the  output  state  of  a node  is  computed  as 
a non-deterministic  function  of  the  input  state: 

1.  Compute  the  momentum  of  the  input  state 

2.  Normalization:  Apply  the  appropriate  isometries  (symmetry  transformations)  to  the 
input  state  and  the  momentum,  so  that  the  momentum  is  normalized. 

3.  Collision:  Choose  at  random  one  of  the  optimal  isometries  of  the  class  to  which  the 
normalized  momentum  belongs,  and  apply  this  isometry* 

4.  Denormalization:  Apply  the  isometries  applied  in  step  2 in  reverse  order  to  obtain 
the  output  state. 


The  application  of  an  isometry  to  a state  is  the  most  frequent  and  important  operation. 
An  efficient  implementation  of  this  operation  is  thus  most  crucial.  Cayley’s  theorem  states 
that  every  group  is  isomorphic  to  a permutation  group,  hence  it  is  not  too  surprising  that 
conditional  application  of  isometries  can  be  implemented  as  conditional  permutations, 
which  in  turn  map  to  simple  multiplexers.  In  essence,  the  algorithm  can  be  viewed  as  a 
description  of  how  to  generate  the  right  control  signals  to  permute  the  input  state  bits. 
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Control  generator  Permutation  network 


Figure  4:  A collision  unit 

The  organization  of  a collision  unit  (Figure  4)  follows  classical  lines:  the  control  path 
consisting  of  a momentum  adder,  a momentum  normalizer,  a small  collision  rule  table, 
and  a randomizer;  the  data  path  is  a conditional  permutation  network  composed  of  a state 
normalizer,  a state  collider  and  a denormalizer  (inverse-normalizer).  The  overall  feed- 
forward character  of  this  unit  makes  it  easy  to  design  a highly  pipelined  version  with  a 
proportional  increase  in  throughput. 

A CMOS  field  programmable  gate  array  implementation  of  the  unit  with  a non- 
pipelined  latency  of  460  ns  has  been  completed.  A CMOS  gate  array  implementation 
is  estimated  to  have  a non-pipelined  latency  below  50  ns.  A collision  unit  capable  of  20 
million  node  updates  per  second  (MNUPS)  or  more  is  clearly  feasible.  This  is  comparable 
to  CRAY-27s  performance  of  approximately  30  MNUPS  [11]. 

4.2  Address  Generator 

As  large  simulation  problems  require  the  use  of  a huge  amount  of  memory,  memory  chips 
can  easily  become  the  dominant  cost  factor  of  the  system.  Our  solution  avoids  the  common 
but  expensive  alternative  of  double  buffering  the  complete  memory  space,  while  retaining 
a high  degree  of  flexibility  in  the  choice  of  lengths  of  each  dimension  of  the  (simulation) 
problem  space.  This  is  made  possible  by  the  virtual  move  addressing  mechanism,  which 
exploits  the  true  data  dependency  of  the  computation  steps  involved.  Data  movement 
implied  by  propagation  but  not  by  communication  requirements  can  thus  be  eliminated. 

The  address  generator  contains  a number  of  registers  and  an  arithmetic  datapath 
(adder)  to  generate  the  complex  sequence  required. 
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4.2. 1 Virtual  Move 

Although  the  FCHC  models  are  our  major  concerns,  the  mechanisms  described  below 
apply  to  a larger  class  of  models  with  other  possible  velocities.  The  following  section  is 
written  with  general  notations  so  as  to  be  valid  for  any  D- dimensional  space  and  arbitrary 

velocity.  

The  propagation  equation  (2)  seems  to  suggest  that  at  every  time  step,  state  bits  of 
all  the  nodes  have  to  be  moved.  However,  a closer  examination  reveals  that  the  equation 
actually  represents  an  invariant  relationship.  If  we  choose  to  observe  in  a frame  of  reference 
moving  at  the  velocity  v*  with  respect  to  the  rest  frame  of  the  lattice,  the  particles  with 
velocity  vl  are  obviously  stationary!  Hence,  there  is  no  need  to  actually  move  the  bits  in 
memory,  as  long  as  we  keep  track  of  the  Galilean  transformation . The  coordinate  xx  of  the 
moving  frame  is  related  to  the  rest  coordinate  x by  the  transformation: 

x1  — x - vH  (4) 

Suppose  the  space-time  point  (x,2)  corresponds  to  (x‘,t),  then  the  point  (x  + vl,i  -f  1) 
corresponds  to  ((x  + vl)  - vl(*  -f  1),  t + 1)  = (x  - \lt,t  + 1)  = (x\t  + 1).  Hence,  (1)  and 
(2)  can  be  written  as 

b V,i)  = C(b(x\0)  (5) 

6l(xt,j  + 1)  = &'(x\t)  (6) 

If  we  interpret  xl  as  the  physical  address  used  to  address  memory  module  i,  then  X 
can  be  treated  as  the  virtual  address.  Equation  (6)  says  that  we  do  not  have  to  move 
the  bits  at  all  in  the  propagation  phase.  We  refer  to  this  technique  as  virtual  move. 
The  cost  of  implementing  virtual  move  is  to  have  a slightly  more  complicated  address 
generation  scheme.  For  each  virtual  address  x,  we  need  to  generate  n physical  addresses, 
x1  (z  = 1, . . . , n)  according  to  (4).  However,  only  one  address  has  to  be  generated  per  cycle, 
if  we  access  one  bit  plane  at  a time.  Multiple  bit  planes  can  be  stored  in  the  same  memory 
space  by  interleaving. 

4.2.2  Multi-dimensional  Modulo  Adder 

The  transformation  (4)  requires  D modulo  subtractions  for  each  v\  How  can  one  proceed 
to  implement  the  address  generators  in  hardware? 

Suppose  we  have  a wrap-around  lattice  space  of  dimension  D,  implied  by  the  basic 
type  of  periodic  boundary  condition  (see  section  2).  Equation  (4)  can  be  written  as 

x'a  — {xa  — vlat)  mod  na  (a  = 1 

= (^a  + (— v'Qt  mod  na))  mod  na  

= (xa  + d'a(t))  mod  na  (7) 

where  na  is  the  length  of  xa-dimension,  and 


dla(t)  - —vlat  mod  Tia 


(8) 


ii  mi  ii  ill  i mu  mm  him mi  i 


3rd  NASA  Symposium  on  VLSI  Design  1991 


7.2.9 


Note  that  d’a  has  to  be  recomputed  only  once  per  time  step  by  addition: 

dia(t  + 1)  = (<£(<)  + (-<  mod  n„))  mod  na  (9) 

In  order  to  use  conventional  RAM,  we  need  to  map  x (x’)  to  a linear  address.  We 
choose  the  conventional  one-to-one  mapping 

A : {0,1, . . . , na}  x {0,1,..  .,n2j  x {0,1,.. . ,nD  - 1}  »-»  {0, 1, . . . ,nxn2  • ■ • nD  - 1} 


such  that  D1 

A(x)  = A((*  = *1  + + ...  + XD  n ».  (io) 

a=l 

Assume  that  all  na’s  are  powers  of  2,  such  that  ma  = log2  na  and  ma  — m ■ The 

mapping  A can  then  be  performed  trivially  by  concatenating  the  binary  representations  of 
x such  that  xa+1  is  on  the  left  of  xa.  Similarly,  we  can  obtain  the  linear  address  of  d' . Let 
a ~ A(x),  and  b = A(d'),  and  define  e as 


0 if  j = £«=i  f°r  some  a 6 [0,  D — 1] 

1 otherwise 


(11) 


The  purpose  of  e is  to  mark  the  boundary  bits  of  dimensions  so  that  carry-out  from  lower 
dimensions  would  not  be  propagated  to  higher  dimensions.  The  value  of  e does  not  change 
during  a simulation. 

We  can  calculate  all  D components  of  x'  according  to  (7)  in  one  step  by  using  a multi- 
dimensional modulo  adder,  which  takes  three  m-bit  inputs,  a,  b,  and  e,  and  computes  the 


sum  as  s = A(x*).  The  adder  can  be  built 
and  3;,  sum: 

according  to  the  new  definitions  of  pi, 

propagate, 

Pi  = 

(a{  V bi)  ei 

(12) 

Si  = 

and  our  usual  definitions  of  gi , generate, 

a,  © bi  © Ci  ei 
and  Ci,  carry: 

(13) 

9i 

= ai  bi 

(14) 

co 

= 0 

(15) 

Ct-j-l 

= 9i  V PiCi 

(16) 

Hence,  this  modified  adder  can  be  implemented  in  various  ways,  such  as  ripple  carry 
adder,  carry  lookahead  adder,  or  carry  select  adder,  as  deemed  appropriate  for  the  system 
requirement  and  implementation  technology. 


4.2.3  Example 

Let  us  illustrate  the  idea  by  a small  example.  Suppose  we  have  the  2-D  square  lattice 
with  n = 4 bits  per  node,  and  v1  = (1,0),  v2  = ( — 1,0),  v3  = (0,1),  and  v4  = (0,-1). 
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bit 

position 

3210 

gjioj  e 

ojlll  a (x=  (3,0)) 

ojpil  b (d2  = ( 1 , 0)  ) 

Ojlll  ol  c 

dooj  s (x2= ( 0 , 0)  ) 

Figure  8:  Operation  of  a multi-dimensional  modulo  adder. 


Figure  9:  Operation  of  the  transposer 


The  lattice  (problem)  space  has  8 nodes  with  n*  = 4 and  rij  — 2.  In  figure  5,  each  box 
represents  a physical  memory  location,  the  linear  address  mapping  of  the  coordinate  of 
which  is  given  in  Figure  6,  as  calculated  by  (10).  At  t = 0,  the  virtual  address  x and 
physical  addresses  xl  (i  = 1,2, 3, 4)  for  any  given  node  are  the  same.  At  t = 1,  they  are 
different,  as  governed  by  (4)  and  shown  in  Figure  7.  The  numeric  label  within  each  box 
represents  a binary  value.  Suppose  we  are  interested  in  the  bit  plane  of  b2.  The  label  “3” 
of  the  b boxes  represents  the  binary  value  of  b2  at  A(x)  = 3 just  after  collision  at  t — 0.  At 
t = 1 just  before  collision,  the  label  “3”  of  the  b2  boxes  is  at  a different  location,  because 
the  bit  has  been  moved  to  the  left  by  one  position,  where  A(x)  = 2.  However,  the  move 
can  be  avoided  if  the  virtual  address  2 is  somehow  translated  into  the  physical  address  3,  so 
that  access  to  A(x)  — 2 becomes  access  to  A(x2)  = 3.  Figure  8 shows  how  the  translation 
can  be  computed  by  a multi- dimensional  modulo  adder. 


4*3  Transposer 

It  contains  2 transpose  buffers,  to  be  filled  and  emptied  alternately.  Figure  9 shows  the 
operation  of  the  transposer.  It  affects  memory  addressing,  data  structure  to  store  the  array. 
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Figure  iO:  Propagation  of  bits  of  62  across  partitions 

During  the  z-th  cycle,  the  z-th  bits  of  the  k words  are  read  and  shifted  into  a transpose 
buffer.  At  the  end  of  the  n cycles,  the  n bits  of  one  node  are  shifted  out  per  cycle.  A 
collision  unit  takes  the  bits  as  input,  and  the  output  is  written  back  to  the  buffer.  It  acts 
as  a circular  shift  register.  With  multiple  (zz)  collision  units,  multiple  updates  (collisions) 
can  be  executed  in  parallel  per  cycle.  This  update  continues  for  k/u  cycles  until  aU  k 
nodes  are  processed.  They  are  then  shifted  out  bit-by-bit  to  memory.  While  one  buffer  is 
busy  acting  as  an  n-wide  circular  shift  register  to  serve  the  collision  units,  the  other  can 
be  emptied  and  refilled  just  in  time  to  take  the  turn,  if  k is  chosen  appropriately. 


4.4  Switch 


Updating  a node  at  the  border  of  a partition  requires  reading  values  from  one  or  more 
adjacent  partitions.  We  need  to  know  when  to  select  data  bits  from  which  partitions.  = 
According  to  (2),  we  know  where  the  neighboring  nodes  are  in  the  problem" space: 

bi{xyt  + 1)  = b[(x  - \\t)  (17) 

Let  us  define  three  coordinate  systems,  namely,  the  global , partition , and  local  coordinates 
such  that  they  satisfy  the  following  relationship: 

xG  = Pxp+xL  (18) 

wKere  P is  a EagonaTmatrix  with  paa  — na,  and  tfie  following  conditions  are  satisfied: 

0 < X"  < na,  0 < xp  < pa,  0 < xG  < napa  (19) 

Alternatively,  we  can  write  for  any  a 

mod  napa  — na(x £ mod  pa)  + mod  na  (20) 

Then  we  can  show  that  for  any  a, 

(xa  ~va)  mod  n«Pa  = n«((xa  + bound(0,  - <,na))  modpa)  -i-  (x^  - v'a)  mod  na  (21) 
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where  bound  is  defined  as 


bound(.L,  k,  U) 


' -1  if  k < L 
< 0 if  L < k < U 
k 1 if  k > U 


(22) 


For  any  given  v\  bound(0,x£  - i4,na)  is  either  non-negative  or  non-positive.  Hence, 
it  is  only  necessary  to  distinguish  whether  the  value  is  zero  or  non-zero. 

In  equation  (21),  the  partition  coordinate  determines  the  source  partition,  and  the 
local  coordinate  determines  the  bit  within  a partition.  Since  the  machine  is  synchronous, 
at  any  one  time,  all  x£  have  the  same  value  for  all  partitions.  This  exactly  matches  the 
requirement  implied  by  (21). 

Figure  10  shows  an  example.  It  shows  the  data  distribution  at  t = 1 for  62  f°r  the  same 
example  shown  in  Figure  5.  The  first  digit  of  a label  represents  the  partition  coordinate 
mapping,  while  the  second  one  represents  the  local  coordinate  mapping. 

The  function  bound  can  be  computed  as  a carry-out  of  the  multi- dimensional  modulo 
adder.  Let  a = A(xL),  <Pa  = -v*,,  and  e as  defined  before,  as  the  inputs  to  a multi- 
dimensional modulo  adder,  then  for  each  a,  |bound(0,x£  — v'a ■>  na ) | >s  exactly  the  carry-out 
bit  which  is  to  be  blocked  so  that  it  will  not  flow  across  the  dimension  boundary.  In 
Figure  8 the  carry  bit  c2  = 1 indicates  that  the  local  virtual  address  3 of  a partition  maps 
to  the  local  physical  address  0 of  its  adjacent  partition,  as  shown  in  Figure  10. 


5 Summary 

We  have  outlined  a number  of  unique  architectural  features  of  a very  high  performance 
pipelined  array  processor  dedicated  to  lattice  gas  simulation.  The  architecture  is  truly 
scalable  in  the  sense  that  it  achieves  linear  speedup  for  both  fixed  and  increasing  problem 
sizes  with  more  processors.  It  is  necessary  and  possible  to  take  advantage  of  the  special 
properties  of  the  application  to  design  application-specific  computers  that  are  a thousand 
times  more  powerful  than  existing  supercomputers. 

The  driving  limitation  of  ALGE  is  memory  bandwidth.  This  situation  becomes  more 
severe  as  the  processing  elements  run  faster  and  the  clock  cycle  gets  shorter.  This  may  be 
an  ideal  project  for  the  use  of  high  density  mounting  and  packaging  technology  such  as 
multiple  chip  modules. 

Current  work  is  focusing  on  resolving  finer  issues  of  design  and  implementation  with 
the  goal  of  building  a prototype  system. 

The  promise  of  powerful  VLSI  processors  for  digital  wind  tunnels  opens  up  the  potential 
for  desk-top  and  onboard  applications. 
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Abstract:  In  this  paper  we  explore  the  specification  and  verification  of  VLSI 
designs.  The  paper  focuses  on  abstract  specification  and  verification  of  func- 
tionality using  mathematical  logic  as  opposed  to  low-level  boolean  equivalence 
verification  such  as  that  done  using  BDDs  and  Model  Checking.  Specification 
and  verification,  sometimes  called  formal  methods , is  one  tool  for  increasing  com- 
puter dependability  in  the  face  of  an  exponentially  increasing  testing  effort. 


1 Introduction 

Reliable  computer  systems  are  becoming  increasingly  difficult  to  engineer.  The  successes  of 
IC  fabrication  technology  have  put  VLSI  engineers  in  the  position  of  building  dependable 
computers  that  are  orders  of  magnitude  more  complex  than  the  largest  computers  of  even 
a decade  ago.  With  even  larger  numbers  of  transistors  promised  in  the  near  term,  research 
is  being  done  to  make  the  reliable  engineering  of  complex  VLSI  designs  practical. 

There  are  two  complimentary  approaches  to  computer  reliability:  fault  tolerance  and 
fault  exclusion.  The  former  is  most  useful  in  handling  dynamic  faults  occurring  during 
system  operation  due  to  component  failure  or  other  unexpected  events.  The  latter  is  a 
static  process  intended  to  remove  errors  in  design  and  implementation  before  the  computer 
system  is  in  service. 

Testing  and  simulation  are  well-known  fault  exclusion  techniques.  Testing  and  simu- 
lation are  used  extensively  in  the  design,  implementation,  and  manufacturing  of  computer 
systems.  The  problem  is  that  testing  and  simulation  can  never  exhaustively  cover  every 
possible  situation  that  the  circuit  might  encounter.  Pygott  [13]  states 

“A  comparatively  simple  8— bit  microprocessor  such  as  the  Z80  has  208  internal 
memory  elements  and  13  input  signals,  meaning  that  the  circuit  is  capable  of 
2221  different  state  transitions.  Even  if  a transition  could  be  simulated  every 
microsecond,  it  would  take  1053  years  to  examine  all  the  possible  changes  (this 
is  far  larger  than  the  age  of  the  universe).” 

Clearly,  only  a tiny  fraction  of  the  possible  state  transitions  can  be  tested.  This  situation 
has  led  to  VLSI  devices  going  to  market  with  design  faults  which  were  not  caught  in  testing. 

One  possible  answer  to  the  inadequacies  of  testing  and  simulation  is  hardware  syn- 
thesis from  high-level  circuit  descriptions  written  in  an  appropriate  hardware  description 
language  (HDL)  such  as  VHDL  [7].  Synthesis  from  an  HDL  description  certainly  has  much 
promise.  Textual  descriptions  are  easy  to  store,  manipulate,  and  process.  Also,  synthesis 
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tools  are  likely  to  be  reliable  since  the  social  process  of  hundreds  of  users  using  a synthesis 
program  tends  to  exorcise  any  latent  bugs. 

Unfortunately,  synthesis  of  VHDL  circuit  descriptions  is  not  sufficient  for  dependable 
computing.  As  a case  in  point,  consider  that  high-level  programming  languages  have  been 
in  use  for  20  years  and  programs  still  contain  numerous  errors.  There  are  a number  of 
reasons  why  this  is  so: 

1.  HDLs  are  generally  verbose  making  them  hard  to  read. 

2.  HDL  constructs  are  not  usually  amenable  to  formal  analysis.  Thus  it  is  nearly 
impossible  to  show  that  a particular  description  has  desired  properties. 

3.  Constructs  that  can  be  synthesized  are  frequently  not  abstract  enough  to  be  of  use 
as  system  specifications. 

4.  Contrary  to  what  the  marketers  of  synthesis  systems  would  have  one  believe,  circuit 
descriptions  outside  of  a small  subset  of  a HDL  cannot  be  synthesized.  This  is  par- 
ticularly true  of  abstract  descriptions.  One  need  not  search  further  than  a multiplier 
to  find  an  example  of  this. 

Because  of  these  limitations  in  testing,  simulation,  and  synthesis,  much  effort  is  being 
expended  in  the  formal  specification  and  verification  of  hardware.  Formal  methods  offer 
hope  of  overcoming  some  of  these  shortcomings  because  they  are  based  on  logic  and  can 
thus  take  advantage  of  the  decades  of  mathematical  research  on  using  logical  analysis. 

1.  Logical  circuit  descriptions  are  often  more  concise  than  conventional  HDL  descrip- 
tions. 

2.  Numerous  formalisms  can  be  embedded  in  logic.  This  allows  the  circuit  specifier  to 
use  the  most  appropriate  formalism  for  the  job  [10]. 

3.  One  can  prove  properties  about  logic  descriptions  directly  using  a proof  system  such 
as  predicate  calculus.  This  can  be  very  effective  for  establishing  that  a specification 
meets  its  requirements  [17]. 

4.  Analysis  can  be  applied  to  the  specification  and  less-abstract  structural  circuit  de- 
scriptions to  show  functional  correctness  [2,9,16}. 

5.  Logic  provides  behavioral,  structural,  data,  and  temporal  abstraction  mechanisms 
for  reducing  the  complexity  of  the  description  [12]. 

For  these  and  other  reasons,  we  believe  that  formal  methods  can  play  an  important  part 
in  increasing  the  reliability  of  computer  systems.  - = = - 

Note  that  we  are  not  suggesting  that  formal  methods  replace  testing,  simulation,  and 
synthesis,  but  rather  that  they  complement  these  techniques.  Figure  1 shows  an  idealized 
ASIC  design  process  (adapted  from  [11]).  The  RTL  circuit  description  is  wxi|l_e_n  in  an 
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Figure  1:  The  ASIC  Design  Process. 
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appropriate  HDL  and  subjected  to  synthesis  and  simulation.  The  specification  is  a more 
abstract  declarative  description  which  is  subject  to  formal  analysis. 

Having  described  the  benefits  of  formal  methods,  there  are  a number  of  directions 
that  we  could  take.  We  could,  for  example,  focus  on  how  formal  methods  interacts  with 
conventional  CAD  tools  or  discuss  the  pros  and  cons  of  the  design  process  in  Figure  1. 
Instead,  we  will  show  how  logic  can  be  used  to  specify  circuits  and  how  proof  functions  as 
a mathematical  analysis  tool  for  reasoning  about  those  specifications.  We  do  not  attempt 
to  give  a complete  survey  of  the  field,  but  rather  focus  on  a demonstration  of  techniques. 

2 Using  Logic  to  Specify  Hardware, 

A circuit  is  a collection  of  devices  composed  by  interconnection.  Each  of  these  devices 
has  ports  which  are  used  for  input,  output,  or  both.  The  behavior  of  a device  can  be 
expressed  in  terms  of  its  ports.  Each  of  the  devices  in  a circuit  can,  in  turn,  be  viewed  as  a 
composition  of  still  other  devices.  This  hierarchy  of  devices  eventually  leads  to  the  devices 
that  the  designer  considers  primitive.  The  smallest  devices  we  will  deal  with  in  this  paper 
are  logic  gates  and  indeed,  in  many  cases,  we  will  stop  much  higher  than  even  gates. 

Clocksin  describes  several  ways  to  specify  circuit  structure  [3]: 

• We  can  use  imperative  declarations  of  the  circuit  structure  (this  is  referred  to  as  the 
extensional  method). 

• We  can  use  functions  to  describe  the  output  in  terms  of  the  input. 

• We  can  use  predicates  in  a quantified  logic  to  relate  the  ports  of  a device  using 
behavioral  or  structural  constraints. 

Each  of  these  methods  has  advantages  and  disadvantages.  The  extensional  method  has 
the  advantage  of  being  familiar  to  designers  since  it  resembles  imperative  languages  such 
as  Pascal  that  most  designers  have  used.  Most  modern  hardware  descriptions  languages 
(e.g.  VHDL)  use  the  extensional  method.  The  largest  disadvantage  of  the  extensional 
method  is  that  it  is  difficult  to  treat  formally,  just  as  imperative  programming  languages 
are  hard  to  treat  formally;  

The  functional  model  is  widely  used;  Hunt’s  specification  of  the  FM8501  microproces- 
sor, for  example,  is  functional  [6].  To  specify  the  behavior  of  sequential  circuits  function- 
ally, the  specification  language  must  support  recursion.  Hunt  uses  recursion  to  describe 
the  sequential  operation  of  his  CPU. - 

In  the  functional  model,  circuit  interconnection  is  given  by  the  syntactic  structure  of 
function  application.  This  can  cause  several  problems: 

• Describing  circuits  with  bi-directional  ports  is  difficult  since  functional  specifications 
differentiate  between  input  and  output  syntactically. 

• The  purpose  of  a structural  specification  is  to  show  how  components  are  connected 
together.  Since  the  only  means  of  expressing  connection  is  function  application,  even 
returning  a tuple  is  insufficient  for  describing  circuits  with  more  than  one  output. 
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Figure  2:  Implementation  of  a simple  circuit,  D 

• Sequential  circuits  feedback  on  themselves.  Recursion  is  the  best  alternative;  but 
that  can  be  inadequate  for  circuits  with  multiple  feedback  paths. 

The  predicate  method  is  a widely  used  specification  technique  [5]  and  is  the  one  we  will 
demonstrate  in  this  paper.  A disadvantage  of  the  predicate  method  is  that  designers  are 
likely  to  find  it  the  most  unfamiliar  of  the  three  and  thus  difficult  to  use.  In  addition,  to  use 
the  predicate  method,  the  logic  must  support  existential  quantification,  either  explicitly 
or  implicitly.  (Prolog  is  an  example  of  a language  with  implicit  existential  quantification.) 
The  predicate  method  does,  however  lend  itself  to  a wide  variety  of  circuit  types,  including 
those  with  multiple  outputs  and  bi-directional  ports. 

2.1  Specifying  Circuits  with  Predicates. 

As  an  example  of  the  predicate  model,  we  will  specify  the  behavior  and  structure  of  a very 
simple  circuit  we  call  D.  The  predicate  that  specifies  the  behavior  of  the  circuit  can  be 
given  by  the  following  logic  definition: 

\~dtf  D(a,b , c ,d , out)  = out  = (a  A b)  V (c  A d) 

Notice  that  the  inputs  and  outputs  are  all  included  in  the  arguments  and  the  behavior  is 
expressed  as  a constraint  among  the  outputs  and  the  inputs. 

One  possible  implementation  for  D is  shown  in  Figure  2.  As  was  mentioned  earlier, 
each  device  can  be  thought  of  as  representing  a constraint  on  its  inputs  and  outputs.  For 
example,  the  top  And  gate  constrains  a,  b,  and  p in  a manner  consistent  with  the  behavior 
of  the  device. 

hdef  And(a,  b,  p)  = (p  = a A b) 

To  get  the  constraint  represented  by  the  entire  device,  we  can  compose  the  individual 
constraints  using  conjunction. 

And(a„  b,  p)  A And(c,  d,  q)  A Or(p,  q,  out) 

This  expression  constrains  the  values  not  only  on  the  ports  of  the  device,  a,  b,  c,  d,  and 
out,  but  also  on  the  internal  lines  p and  q.  We  normally  wish  to  regard  such  a device  as 
a ‘blackbox”  and  consequently  are  only  interested  in  the  values  of  the  external  lines.  We 
can  hide  the  internal  lines  using  existentially  quantified  variables  and  define  a predicate 
D_imp  that  represents  the  structure  of  the  circuit. 
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I ~dtf  Dwimp(a,  b,  c,  d,  out)  = 

3 p q.  And(a,  b,  p)  A And(c,  d,  q)  A Or(p,  q,  out) 

While  this  formula  looks  confusing  at  first,  we  should  note  that  this  level  of  specification 
can  be  produced  automatically  from  netlists  or  traditional  HDL  models. 

For  comparison,  the  following  specification  describes  the  same  circuit  using  functions: 

\~def  D(a,b,c,d)  = Or(And(a,b)  , And(c , d) ) 

The  outputs  are  not  mentioned  explicitly;  the  result  of  the  function  is  taken  to  be  the 
output  of  the  circuit. 

Similarly,  we  can  write  a extensional  specification  of  the  circuit  in  a hardware  descrip- 
tion language  such  as  VHDL  [1]: 

Entity  D_imp  is 

port(a,  b,  c,  d : in  Bit;  outp  :out  Bit); 
end  D.imp; 

architecture  Structure  of  D_imp  is 

component  ANDGate  port (il , i2 : in  Bit;  outp  :out  Bit); 
component  ORGate  port (ii , i2 : in  Bit;  outp  :out  Bit); 
signal  p,  q:  Bit 

Gi:  ANDGate  port  map  (a,  b,  p) ; 

G2:  ANDGate  port  map  (c,  d,  q) ; 

G3:  ORGate  port  map  (p,  q,  outp); 

end  Structure;  

The  difference  between  this  specification  and  the  predicate  model  of  the  circuit  structure  is 
largely  superficial.  The  primary  difference  is  the  abundance  of  keywords  in  the  extensional 
specification.  The  biggest  impediment  to  using  specification  languages  such  as  VHDL  is 
that  they  sometimes  lack  a clear  semantics.  This  problem  can  be  overcome  by  defining  a 
semantics  of  the  specification  language  in  the  object  language  of  a verification  tool  such 
as  HOL.  Van  Tassel  has  done  just  that  using  VHDL  and  HOL  in  [14,15]. 

2.2  Specifying  Sequential  Behavior. 

The  last  section  specified  a simple  combinatorial  circuit.  We  specify  the  behavior  of  se- 
quential circuits  in  higher  -order  logic  using  an  explicit  representation  of  time. 

For  example,  we  can  specify  the  behavior  of  a simple  latch  as  follows: 

b dej  latch  in  out  set  = V t.  out  (t  + 1)  = set  t ->  in  t I out  t 

In  the  specification,  in,  out,  and  set  are  functions  of  time.  The  value  of  a signal  at  time 
t is  returned  when  the  function  representing  the  signal  is  applied  to  t.  The  specification 
says  that  the  value  of  out  at  time  t + 1 gets  the  value  of  the  input  port,  in,  at  time  t if 
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the  set  line  is  high  and  remains  unchanged  otherwise.  Universal  quantification  over  time 
is  used  in  defining  the  predicate. 

We  can  also  use  existential  quantification  to  describe  temporal  operators.  For  example, 
suppose  that  we  wish  to  define  a predicate  that  says  that  a signal  will  eventually  go  high. 
The  following  is  a definition  of  an  EVENTUALLY  operator: 

hdef  EVENTUALLY  d ti  = 3 t2.  t2  > tl  A d t2 

When  applied  to  the  signal  d,  and  the  current  time,  tl,  the  predicate  states  that  there  exists 
a time,  t2,  in  the  future  when  the  signal  d will  be  true.  The  use  of  existential  quantification 
over  time  is  also  used  to  specify  the  behavior  of  asynchronous  interconnections  between 
devices.  Joyce  [9]  has  shown  how  temporal  logic  can  be  embedded  in  higher-order  logic. 

2.3  Behavioral  Abstraction  and  Specification. 

There  are  many  ways  of  specifying  the  same  circuit.  For  example,  in  specifying  a two  input 
binary  decoder,  one  might  write: 

b dtf  decoder_spec  sO  si  oO  ol  o2  o3  = 

(oO  = (si  ->  (sO  ->  F I F)  | (sO  ->  F | T)))  A 

(ol  = (si  -*  (sO  -4  F | F)  | (sO  ■->  T | F)))  A 

(o2  = (si  (sO  F | T)  I (sO  F | F)))  A 

(o3  = (si  — > (sO  ->  T | F)  | (sO  ->  F | F))) 

While  this  specification  is  correct,  its  meaning  is  not  very  clear. 

Here  is  another  specification  for  the  same  behavior: 

b decoder.spec  sO  si  oO  ol  o2  o3  = 

(oO  = 1 A -<s0)  A 

(ol  = -*sl  A sO)  A 

(o2  = si  A -isO)  A 

(o3  = si  A sO) 

This  specification  closely  models  one  possible  implementation  for  the  circuit;  consequently, 
using  it  as  the  behavioral  specification  would  make  the  verification  easier,  but  would  not 
tell  us  much  about  the  abstract  behavior  of  the  decoder. 

The  next  specification  is  more  abstract  and  says  more  about  the  behavior  of  the  decoder: 

b dcf  decoder_spec  sO  si  oO  ol  o2  o3  = 

(oO  <-  ( (si , sO)  = (F,F)))  A 
(ol  ( (si , sO)  = (F,T) ) ) A 
(o2  <->  ( (si , sO)  = (T,F) ) ) A 
(o3  <->  ( (si , sO)  = (T,T))) 

This  specification  clearly  shows  the  binary  numbers  being  represented  by  the  inputs.  More- 
over, the  specification  does  not  suggest  any  particular  implementation.  In  general,  the  more 
abstract  a specification,  the  easier  it  is  to  understand,  but  more  difficult  it  is  to  verify. 

We  can  make  the  above  specification  even  more  abstract  by  defining  a function,  pairval, 
that  converts  boolean  pairs  into  numbers  and  then  writing  the  specification  as  follows. 


7.3.8 


decoder.spec  sO  si  oO  ol  o2  o3  = 
let  n = pairval(sl,sO)  in 
(oO  <-*  (n  = 0))  A 
(ol  <r+  (n  = 1))  A 
(o2  <-*  (n  = 2))  A 
(o3  <->  (n  = 3)) 

This  specification  can  be  readily  generalized  to  have  n inputs  and  2n  outputs. 


2.4  Specifying  a Microprocessor 

So  far,  the  circuits  we  have  described  have  been  simple,  for  expository  purposes.  One 
should  not  assume  that  all  specifications  must  be  of  small  devices.  Indeed,  logic  is  most 
useful  when  used  on  large,  abstract  specifications.  To  demonstrate  the  use  of  formal 
specification  on  a larger  example,  we  will  present  the  specification  of  a small  microprocessor 
called  Tamarack. 

There  have  been  numerous  efforts  to  verify  microprocessors  [4,8,6].  Most  of  these  have 
used  the  same  implicit  behavioral  model.  In  general,  the  model  uses  a state  transition 
system  to  describe  the  microprocessor.  A microprocessor  specification  has  four  important 
parts: 

1.  A representation  of  the  state,  S.  This  representation  varies  depending  on  the  verifi- 
cation system  being  used. 

2.  A set  cf  state  transition  functions,  J,  denoting  the  behavior  of  the  individual  instruc- 
tions of  the  microprocessor.  Each  of  these  functions  takes  the  state  defined  in  step 
(1)  as  an  argument  and  returns  the  state  updated  in  some  meaningful  way. 

3.  A selection  function,  N,  that  selects  a function  from  the  set  J according  to  the 
current  state. 

4.  A predicate,  I,  relating  the  state  at  time  i + 1 to  the  state  at  time  t by  means  of  J 
and  N. 

In  some  cases,  the  individual  state  transition  functions,  J,  and  the  selection  function,  N, 
are  combined  to  form  one  large  state  transition  function. 

To  make  all  of  this  mode  concrete  consider  the  top-level  specification  of  Tamarack 
presented  by  Joyce  in  [9]. 

\~def  TamaxackBeh  (ireq,  mem,  pc,  acc,  rtn,  iack)  = 

Vt : time . 

(mem  (t-fl),pc  (t-bl),acc  (t  + l),rtn  (t+l),iack  (t-fl))  = 

NextState  (ireq  t.mem  t,pc  t,acc  t,rtn  t.iack  t) 


The  top-level  specification  relates  the  state  of  the  assembly  language  level  registers  at  time 
t + 1 to  their  state  at  time  t using  the  function  NextState.  The  level  of  abstraction  in  the 
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top-level  specification  is  roughly  that  found  in  an  assembly  language  reference  manual. 
The  difference  is  that  the  formal  specification  is  less  ambiguous  and  more  complete. 

The  next  state  function  chooses  among  the  many  individual  instructions  according  to 
a selection  criteria  which  describes,  in  an  abstract  way,  instruction  decoding: 

\~dtj  NcxtStato  (ireq,  mem,  pc,  acc,  rtn,  iack)  = 
let  opcval  = OpcVal  (mem, pc)  in 

((ireq  A -^iack)  — > IRQ_SEM  (mem , pc , acc , rtn, iack)  | 

(opcval  = JZR_0PC)  — ► JZR_SEM  (mem , pc , acc , rtn , iack)  | 

(opcval  = JMP^OPC)  — * JMP_SEM  (mem , pc , acc , rtn , iack)  | 

(opcval  = ADD_OPC)  — » ADD_SEM  (mem , pc  , acc , rtn, iack)  | 

(opcval  = SUB.OPC)  — ► SUB_SEM  (mem , pc , acc , rtn , iack)  | 

(opcval  = LDA_0PC)  — ♦ LDA_SEM  (mem , pc , acc , rtn, iack)  | 

(opcval  = STA_0PC)  — ► STA_SEM  (mem, pc , acc, rtn, iack)  | 

(opcval  = RFI__0PC)  — ► RFI_SEM  (mem, pc ,acc , rtn, iack)  | 

N0P_SEM  (mem , pc , acc , rtn, iack)) 

Each  of  the  instructions  available  to  the  programmer  as  well  as  actions  that  take  place 
on  instruction  boundaries  such  as  interrupts  are  defined  using  a function  on  the  state  and 
environment  variables  that  returns  a new  state  updated  as  appropriate  for  the  instruction 
being  specified.  We  use  the  ADD  instruction  as  an  example: 

H dtj  ADD_SEM  (mom: *momory ,pc : *wordn, acc: *wordn, rtn: *wordn, iack :bool)  = 
lot  inst  = fetch  (mom , (address  pc))  in 
lot  operand  = fotch  (mom , (address  inst))  in 
(mom,  inc  pc,  add ( acc , operand) , rtn,  iack) 


This  instruction  increments  the  program  counter  and  stores  the  result  of  adding  the  accu- 
mulator to  the  contents  of  memory  pointed  to  by  the  current  instruction  in  the  accumulator. 
No  other  state  changes  occur. 

There  are  at  least  three  kinds  of  abstraction  taking  place  between  the  register  transfer 
level  (RTL)  description  of  Tamarack  and  the  top-level  specification  given  above. 

1 . Behavioral  Abstraction  — The  RTL  description  of  Tamarack  is  a structural  model 
that  says  how  the  major  blocks  are  connected.  The  top-level  specification  says 
nothing  about  the  structure  of  the  microprocessors,  but  rather  describes  the  required 
behavior. 

2.  Data  Abstraction  — The  RTL  description  contains  registers  that  are  not  of  interest 
in  the  top-level  specification.  A good  example  of  these  types  of  registers  is  the 
instruction  register  which  is  vital  to  the  correct  functioning  of  the  microprocessor, 
but  is  not  considered  in  the  top-level  specification. 

3.  Temporal  Abstraction  — Events  at  the  RTL  level  happen  at  a much  finer  time 
granularity  than  events  at  the  top-level.  Events  at  the  top-level  are  measured  on  a 
time-scale  that  coincides  with  the  execution  of  macro-level  instructions.  Events  at 
the  RTL  level  are  measured  by  the  sub-cycle  clock.  Many  RTL  level  events  must 
take  place  to  cause  one  top-level  event  to  happen. 
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3 Using  Proof  to  Analyze  Specifications 

Proof  can  be  used  in  at  least  two  ways  to  analyze  specifications.  The  first  methods  asks 
the  question  Is  my  specification  correct?  The  second  methods  asks  the  question  Does  my 
implementation  meets  the  specification? 

3.1  Design  Verification 

Determining  whether  or  not  a specification  is  correct  is  not  a question  that  can  be  subjected 
to  exhaustive  mathematical  analysis  since  the  design  is  an  intellectual  artifact,  not  a math- 
ematical one.  We  can,  however,  determine  whether  a specification  meets  its  requirements 
to  the  extent  that  those  requirements  can  be  formulated  in  logic. 

An  example  of  this  is  the  verification  of  two  important  properties  of  the  supervisory 
mode  of  a microprocessor  called  AVM-1  [17].  AVM-1  has  a supervisory  mode  that  is 
controlled  by  the  supervisory  mode  bit  in  a register  called  the  program  status  word  (PSW). 
When  the  processor  is  in  supervisory  mode,  certain  registers  in  the  register  file  (which  does 
not  include  the  PSW)  become  writable.  Otherwise  they  can  only  be  read. 

One  of  the  design  requirements  can  be  stated  informally  as  follows: 

Property  1 (Integrity  of  Privileged  Registers)  If  the  CPU  is  not  in  supervisory  mode 
and  the  next  instruction  is  not  an  external  or  user-generated  interrupt , then  every  privi- 
leged register  remains  unchanged. 

The  integrity  of  the  privileged  registers  is  only  important  at  the  assembly  language 
programmer’s  level  of  the  CPU.  We  do  not  care  if  the  registers  change  on  a finer  time  scale 
so  long  as  they  remain  the  same  when  viewed  by  the  outside  world. 

The  formalization  of  this  requirement  is  not  difficult.  The  following  expression  captures 
the  essence  of  the  problem: 

V n . (IS_SUP_REG  n)  => 

(EL  n (macro_reg  (t  + 1))  — 

(EL  n (macro^reg  t))))) 


The  expression  states  that  the  register  file  (represented  by  a list)  is  the  same  for  every 
supervisory  mode  register  at  time  t + 1 as  it  was  at  time  t . 1 

The  basic  requirement,  stated  above,  must  follow  from  the  definition  of  the  top-level 
of  AVM-l  (AVMJSeh)  and  is  subject  to  the  following  conditions: 

1.  The  CPU  is  not  currently  in  supervisory  mode  (expressed  as  -<get_sm  (psw  t)). 

2.  The  next  instruction  is  not  an  internal  or  external  interrupt  (expressed  in  the  speci- 
fication as  (Opcode  . , . = INTJ3PC0DE)  and  ^(Opcode  EINT_0PC0DE)  . 


lEL  selects  the  nth  member  of  a list. 
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b AVM_Boh 

(A  t.  (reg.list  t,psw  t>pc  t ,m©m  t,ivec  t)) 

(A  t.  (ir©q__e  t))  => 

(V  t. 

^get_sm  (psw  t)  A 

-"•(Opcode  (reg_list  tfpsw  t,pc  t,mem  t ,ivec  t) 
(ireq_e  t)  = INT.OPCODE)  A 
-•(Opcode  (reg.list  t.psw  t,pc  t,mem  t,ivec  t) 
(ireq_e  t)  = EINT.OPCODE  ) => 

(Vn.  IS_SUP_REG  n => 

(EL  n(reg_list(t  + 1))  = EL  n(r©g_list  t)))) 


This  theorem  is  not  difficult  to  establish  and,  when  combined  with  a correctness  proof  (see 
Section  3.2),  gives  confidence  that  the  supervisory  mode  works  as  it  should. 


3.2  Functional  Verification 

A second,  and  complimentary,  use  of  proof  is  in  showing  that  our  specification  is  correctly 
implemented  by  the  structure  that  we  have  chosen  for  the  RTL  model. 

A simple  example  is  given  by  the  circuit  D specified  in  Section  2.1.  To  show  that  the 
implementation  (represented  by  D_imp)  meets  its  specification  (represented  by  D),  we  prove 
the  following  theorem: 


b V a b c d out  . D_imp(a,b ,c ,d, out)  D(a,b , c ,d,out) 


This  theorem  could  be  proven  using  any  number  of  techniques.  Indeed,  while  it  is  a simple 
example,  it  has  little  to  do  with  the  kinds  of  proofs  of  correctness  that  occur  most  frequently 
or  that  are  the  most  interesting. 

A more  interesting  example  is  given  in  the  proof  of  correctness  of  Tamarack  [9]  since 
the  proof  involves  behavioral,  data,  and  temporal  abstraction.  We  have  already  seen  the 
specification  of  the  top-level  of  Tamarack  (see  Section  2.4).  The  RTL  model  is  a fairly 
large,  but  conventional  description  of  the  large  grain  structure  of  the  microprocessor. 

In  order  to  understand  the  correctness  theorem,  we  must  describe  the  temporal  abstrac- 
tion that  takes  place  between  the  RTL  model  and  the  top-level  behavioral  description.  As 
we  have  already  mentioned,  different  levels  in  the  specification  have  different  views  of  time. 
We  use  temporal  abstraction  to  produce  a function  that  maps  time  at  one  level  to  time  at 
another.  Figure  3 shows  a temporal  abstraction  function  T * The  circles  represent  clock 
ticks.  Note  that  the  number  of  clock  ticks  required  at  the  bottom-level  to  produce  one 
clock  tick  at  the  top-level  is  irregular. 

The  predicate,  Q , is  true  whenever  there  is  a valid  abstraction  from  the  lower  level 
to  the  upper  level.  We  can  define  a generic  temporal  abstraction  function  in  terms  of  Q . 
In  a microprocessor  specification,  Q is  usually  a predicate  indicating  when  the  lower  level 
machine  is  at  the  beginning  of  its  cycle — a condition  that  is  easy  to  test. 
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Figure  3:  The  function  F,  which  maps  time  at  one  level  to  another,  can 
be  defined  in  terms  of  a predicate,  G,  which  is  true  only  when 
the  mapping  occurs. 


We  will  use  a function  TimeOf  as  our  temporal  abstraction  function.  The  function  is 
defined  recursively  so  that  (TimeOf  g 0)  is  the  first  time  that  the  predicate  g is  true  and 
(TimeOf  g (n+1))  is  the  next  time  after  time  n when  g is  true.  We  will  not  develop  the 
details  of  the  temporal  abstraction  function  here,  but  refer  the  interested  reader  to  [9]. 

The  final  correctness  theorem  for  Tamarack  states  that  the  behavioral  model  (defined 
by  TamarackBeh)  follows  from  a system  (AsynSystem)  composed  of  the  RTL  model  and 
an  asynchronous  memory  subsystem. 

b AsynSystem  (idreq.mpc ,mar ,pc ,acc ,ir ,rtn,arg ,buf .idack, deck, mem)  A 
((val4  o mpc)  0=0) 

=> 

let  f = TimeOf  ((((val4  rep)  o mpc)  Eq  0)  and  (not  dack))  in 
TamarackBeh  (idreq  o f,mem  o f ,pc  o f.acc  o f,rtn  o f.idack  o f) 


The  function  f is  the  function  T of  Figure  3.  We  also  have  a reset  condition  that  requires 
that  the  value  of  the  microprogram  counter,  mpc,  be  0 at  time  0. 

Presenting  the  proof  of  the  correctness  theorem  for  Tamarack  is  beyond  the  scope  of  this 
paper.  The  proof  is  actually  quite  straightforward  in  most  cases,  involving  standard  proof 
techniques  such  as  substitution,  case  analysis,  and  induction.  Indeed,  much  of  the  difficulty 
is  caused  by  the  size  of  the  proof  effort  rather  than  the  puzzling  nature  of  the  theorems. 
Tamarack  is,  of  course,  far  from  being  the  largest  device  with  a verified  correctness.  Recent 
research  has  developed  techniques  for  managing  much  of  the  complexity  of  proofs  of  this 
sort  [16].  The  techniques  are  demonstrated  in  the  proof  of  correctness  of  AVM-1  . 

One  should  not,  of  course,  accept  that  the  microprocessor  is  correct  simply  because 
there  is  a theorem.  The  idea  is  that  proof  constitutes  engineering  analysis  and  like  an 
engineering  analysis,  must  be  documented  and  subject  to  review.  WRat  we  have  presented 
here  is  not,  of  course,  an  engineering  analysis. 
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4 Conclusions 

This  paper  has  shown  how  logic  can  be  used  to  specify  and  analyze  hardware  designs.  The 
use  of  formal  methods  has  a number  of  advantages. 

• Specifications  give  a clear  and  precise  statement  of  the  intended  behavior  of  a design. 

• Specifications  can  be  analyzed  to  determine  whether  or  not  they  meet  the  require- 
merits  of  the  design. 

• Functional  correctness  can  be  demonstrated  through  analysis  rather  than  testing. 

• Assumptions  are  made  explicit. 

We  do  not  suggest  that  formal  methods  replace  conventional  engineering  practices, 
but  augment  them.  Work  is  continuing  to  bring  tools  based  on  formal  methods  into  the 
designers  toolbox: 

• We  are  developing  new  high-level  models  of  common  hardware  devices  which  guide 
the  specification  and  verification  of  those  devices. 

• We  are  writing  translators  between  hardware  description  languages  used  by  conven- 
tional CAD  tools  and  verification  tools. 

• We  are  doing  case  studies  to  serve  as  examples  of  specification  and  verification. 

These  efforts,  and  similar  efforts  at  other  institutions  promise  to  make  formal  methods 
tractable  for  large-scale  use  in  VLSI  design. 


References 

[1]  J.  R.  Armstrong.  Chip-Level  Modeling  with  VHDL.  Prentice  Hall,  1989. 

[2]  A.  Camilleri,  M.  Gordon,  and  T.  Melham.  Hardware  verification  using  higher  order 
logic.  In  D.  Borrione,  editor,  From  HDL  Descriptions  to  Guaranteed  Correct  Circuit 
Designs.  Elsevier  Scientific  Publishers,  1987. 

[3]  W.  F.  Clocksin.  Logic  programming  and  digital  circuit  analysis.  The  Journal  of  Logic 
Programming , 4:59-82,  1987. 

[4]  A.  Cohn.  Correctness  properties  of  the  VIPER  block  model:  The  second  level.  Tech- 
nical Report  134,  University  of  Cambridge  Computer  Laboratory,  May  1988. 

[5]  M.  J.  Gordon.  Why  higher-order  logic  is  a good  formalism  for  specifying  and  verifying 
hardware.  In  G.  J.  Milne  and  P.  A.  Subrahmanyam,  editors,  Formal  Aspects  of  VLSI 
Design , pages  153—177.  Elsevier  Scientific  Publishers,  1986. 


7.3.14 


[6]  W.  A.  Hunt.  The  mechanical  verification  of  a microprocessor  design.  In  D.  Borrione, 
editor,  From  HDL  Descriptions  to  Guaranteed  Correct  Circuit  Designs.  Elsevier  Sci- 
entific  Publishers,  1987. 

[7]  IEEE  Std  1076-1987.  IEEE  Standard  VHDL  Language  Reference  Manual , 1987. 

[8]  J.  J.  Joyce.  Formal  verification  and  implementation  of  a microprocessor.  In 
G.  Birtwhistle  and  P . Subrahmanyam,  editors,  VLSI  Specification,  Verification,  and 
Synthesis.  Kluwer  Academic  Press,  1988. 

[9]  J.  J.  Joyce.  Multi-Level  Verification  of  Microprocessor-Based  Systems.  PhD  thesis, 
Cambridge  University,  December  1989. 

[10]  J.  J.  Joyce.  More  reasons  why  higher-order  logic  is  a good  formalism  for  specifying 
and  verifying  hardware.  In  Proceedings  of  the  A CM/SIGD A International  Workshop 
in  Formal  Methods  in  VLSI  Design,  January  1991. 

[11]  K.  Keutzer.  Panel  discussion:  Model  checking,  theorem  proving,  and  CAD.  In 
ACM/SIGDA  International  Workshop  in  Formal  Methods  in  VLSI  Design,  January 
1991. 

[12]  T.  Melham,  Abstraction  mechanisms  for  hardware  verification.  In  G.  Birtwhistle  and 
P.  A.  Subrahmanyam,  editors,  VLSI  Specification,  Verification  and  Synthesis.  Kluwer 
Academic  Publishers,  1988. 

[13]  C.  Pygott.  Noden_HDL:  an  engineering  approach  to  hardware  verification.  In 
G.  Milne,  editor,  The  fusion  of  Hardware  Design  and  Verification.  Elsevier  Science 
Publ.  B.V.IFIP,  1988. 

[14]  J.  P.  V.  Tassel.  The  semantics  of  VHDL  with  VAL  and  HOL:  Towards  practical 
verification  tools.  Master  s thesis,  Department  of  Computer  Science  and  Engineering, 
Wright  State  University,  1989. 

[15]  J-  P-  V.  Tassel  and  D.  Hemmendinger.  Toward  formal  verification  of  VHDL  speci- 
fications. In  L.  Claesen,  editor,  Applied  Formal  Methods  for  Correct  VLSI  Design, 
Leuven,  Belgium,  November  1989.  Elsevier  Science  Publishers. 

[16]  P.  J.  Windley.  The  Formal  Verification  of  Generic  Interpreters.  PhD  thesis,  Univer- 
sity of  California,  Davis,  Division  of  Computer  Science,  June  1990. 

[17]  P.  J.  Windley.  Using  correctness  results  to  verify  behavioral  properties  of  micropro- 
cessors. In  Proceedings  of  the  IEEE  Computer  Assurance  Conference,  June  1991. 


3rd  NASA  Symposium  on  VLSI  Design  1991 


N 94-18384 


8.1.1 


A New  Variable  Testability  Measure 

M.  Jamoussi,  B.  Kaminska,  D.  Mukhedkar 
Department  of  Electrical  and  Computer  Engineering 
Ecole  Polytechnique  de  Montreal 
P.O.Box  6079,  Station  A 
Montreal,  Canada  (H3C  3A7) 

Abstract - In  this  paper,  we  propose  a new  Variable  Testability  Measure  (VTM) 
for  implementing  testability  at  the  high-level  synthesis  stage  of  the  design 
process  of  integrated  circuits.  This  new  approach,  based  on  binary  decision 
diagrams,  representing  fully  functional  blocks  of  a circuit,  and  on  their  cyclo- 
matic  testability  measures.  It  manipulates  dataflow  blocks  to  predict  whether 
the  circuit  is  testable  and  the  vector  set  required  to  test  it. 


1 Introduction 

In  recent  years,  the  use  of  silicon  compilation  and  other  standard  cell  design  tools  has 
changed  the  way  digital  systems  are  designed.  As  a result,  more  and  more  systems  are  being 
designed  at  the  functional  level,  with  little  gate-level  design  being  explicitly  performed. 
So,  an  appropriate  measure  can  be  developed  which  efficiently  represents  knowledge  about 
functional-level  testability. 

This  paper  addresses  an  approach  for  implementing  testability  in  the  high-level  synthe- 
sis process,  and  particularly  at  the  functional-level  stage.  Then,  a new  Variable  Testability 
Measure  (VTM)  is  defined  and  used  to  evaluate  the  dataflow  testability  in  an  advanced 
step  of  the  design  process. 

2 Variable  Testability  Measure 

We  are  interested  in  the  testability  of  integrated  circuits  as  early  as  possible  in  their  design 
process. 

Usually,  testability  implementation  is  left  until  after  the  design  is  completed,  which 
requires  greater  effort  at  later  stages.  Testability  analysis  tools,  such  as  SCOAP  [7],  which 
are  supposed  to  support  testability  incorporation  during  the  design  stages,  actually  provide 
poor  predictions  of  testability  and  do  not  suggest  how  and  what  test  methodologies  should 
be  applied. 


2.1  High-Level  Synthesis 

Synthesis  involves  finding  a structure  that  implements  the  behavior,  the  constraints  and 
the  goals  of  a given  system.  Generally,  synthesis  may  be  considered  at  various  levels 
of  abstraction  because  designs  can  be  described  at  various  levels  of  details.  High-level 
synthesis  [9]  is  the  type  of  synthesis  that  begins  at  what  is  often  called  the  algorithmic  level. 
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It  takes  an  abstract  behavioral  specification  of  a digital  system  and  finds  a register/transfer- 
level  structure  to  realize  the  given  behavior. 

The  system  to  be  designed  is  usually  represented  at  the  algorithmic  level  by  a program- 
ming language  such  as  PASCAL  [10]  or  ADA  [6],  or  by  a hardware  description  language 
similar  to  a programming  one,  such  as  VHDL  [8].  The  first  step  in  high-level  synthesis 
is  usually  the  compilation  of  the  formal  language  describing  the  system  behavior  into  an 
internal  representation. 

The  next  two  steps  in  synthesis  transform  core  behavior  into  structure:  scheduling 
and  allocation.  They  are  closely  interrelated  and  interdependent.  Scheduling  consists 
in  assigning  the  operations  to  so  called  control  steps,  fundamental  sequencing  units  in 
synchronous  systems  and  corresponds  to  a clock  cycle.  Allocation  consists  in  assigning 
the  operations  to  hardware.  Finally,  the  design  has  to  be  conyerted  into  real  hardware. 
Lower-level  tools  such  as  logic  synthesis  and  layout  synthesis  complete  the  design. 

The  advantages  of  implementing  testability  as  early  as  possible  in  the  design  process  are 
a testable  design,  a reduced  test  cost  and  an  earlier  detection  of  intestable  blocks.  To  incor- 
porate testability  in  high-level  synthesis  as  a constraint  of  the  design  specifications,  a new 
concept  of  Variable  Testability  Measure  is  introduced  providing  information  if  the  circuit 
is  testable  and  permits  a good  prediction  of  the  test-vectors  set  for  a graph-representation 
of  a circuit,  — 

2,2  Variable  Testability  Measure 

A new  method  for  gathering  the  testability  information  at  the  dataflow-design  stage  called 
the  Variable  Testability  Measure  (VTM)  is  introduced.  It  permits  an  easy  propagation  of 
the  information  about  functional-block  testability  and  indicates  that  testability  problems 
are  very  easily  dealt  with  high-level  synthesis. 

Using  a hierarchical  abstraction  principle,  VTM  will  be  able  to  provide  two  kinds  of 
informations:  first,  whether  or  not  each  subcircuit  and  then  the  whole  circuit  are  testable; 
second,  the  minimum  number  of  test  vectors  required  for  a specific  block  of  the  circuit, 
and  then  for  the  whole  circuit.  This  approach  is  based  on  the  binary  decision  diagram  [1] 
and  the  cyclomatic  testability  measure  [2]. 

The  Binary  Decision  Diagram  (BDD)  is  a method  for  defining,  analyzing,  testing,  and 
implementing  large  digital  functions.  It  provides  a complete  implementation-free  descrip- 
tion of  the  functions  involved.  One  of  the  areas  in  which  these  diagrams  can  be  particularly 
useful  is  that  of  test  generation,  i.e.  finding  a set  of  inputs  able  to  confirm  that  a given  im- 
plementation performs  correctly.  Finally,  BDDs  may  be  directly  interconnected  to  define 
still  larger  functions. 

The  Cyclomatic  Testability  Measure  (CTM)  is  a method  for  predicting  and  determining 
a minimum  set  of  test  vectors  for  graphs  of  combinational  and  sequential  circuits  in  the 
early  conceptual  stage  of  the  design  process.  This  approach  is  based  on  the  cyclomatic 
number  [3],  and  the  BDD.  If  we  note  V the  CTM  of  a given  BDD  called  (7,  V is  computed 
by  the  following  equation: 
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V(G)  = e-n  + 3 l1) 

where  e and  n are  the  numbers  of  edges  and  nodes  in  the  graph  G respectively. 

Finally,  it  was  shown  that  the  CTM,  called  V,  of  a circuit  composed  of  p subcircuits 
and  represented  by  a BDD  called  G,  is  computed  by  the  following  equation: 

V(G)  = V(Gi)  + £(V(G.)  - 2)  (2) 

1=2 

where  V{Gi)  is  the  CTM  of  the  subcircuit  containing  the  entry  node  represented  by  the 
BDD,  called  <7a.  V(Gt)  corresponds  to  the  ith  subcircuit  represented  by  its  BDD,  called 

Gi,  (i  = 2,  ...,p). 

2.3  Definition  of  VTM 

The  concept  of  VTM  is  a further  development  of  the  Cyclomatic  Testability  Measure. 
The  VTM  is  defined  as  follows:  Consider  a functional  block  having  its  input  and  output 
variables  given  on  n bits,  such  as  an  adder  or  a multiplier.  VTM  is  a coefficient  assigned 
to  each  bit  of  the  input  and  output  variables  of  this  functional  block.  The  VTM  of  a bit 
means  the  minimum  number  of  test  vectors  required  to  test  it.  In  this  effect  a variable 

given  on  ti  bits  has  ti  different  VT^/ls,  one  for  each  bit. 

The  VTM  permits  treatment  of  functional  blocks  having  their  inputs  and  outputs  given 
on  various  numbers  of  bits,  while  the  CTM  treats  blocks  with  single  outputs  and  inputs 
given  on  one  bit  each  of  them. 

The  advantage  of  the  VTM  is  the  easy  composition  of  various  blocks,  which  permit 
testability  incorporation  in  the  dataflow  of  the  synthesis  process. 

3 Use  of  VTM  in  High-Level  Synthesis 

Throughout  this  paragraph,  we  try  to  use  this  new  measure  for  some  common  functional 
primitives.  Next,  we  will  see  how  this  new  measure  is  involved  in  the  evaluation  of  the 
testability  of  a circuit  in  its  high-level  synthesis  stage. 


3.1  Computation  of  VTM  for  some  functional  blocks 

We  try  to  determine  the  VTMs  of  the  outputs  of  some  functional-level  circuit  primitives 
such  as  comparators,  adders,  multipliers,  logical  operators  , multiplexors  which  basically 
form  the  data  flow  of  a circuit.  Three  cases  are  discussed  in  this  section. 

For  the  cases  of  functional  primitives  studied,  we  note  A and  B the  inputs  of  these 
blocks,  each  one  is  given  on  n bits:  AnAn-\..Ai..A2A\  and  BnBn-\..Bi..B2B\  respectively. 
This  notation  means  that  An,An-i, ...IA2,A1  and  are  the  binary 

encoding  of  A and  B respectively.  We  also  note  ai,a2,..,a;,..,an  and  bi,?>2)..,&i the  VTMs 
of  A and  B respectively. 
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3.2  A Comparator 

Let  us  consider  the  case  of  a comparator  of  two  variables  A and  B,  given  on  n bits  The 

logical  value  of  the  output  5 is  1,  while  A is  greater  than  B (A  > B).  If  not,  the  value  of 
IS  u. 


In  figure  1,  we  present  the  BDD  describing  this  functional  block. 
(1)  and  (2),  the  VTM  of  S,  noteed  s,  is; 


According  to  equations 


s - (4  * n + 3)  — (an  + &,)  + 2 * ^(o,-  + 6;  -f  4) 

»=i 

So  the  comparator  of  two  variables,  given  on  n bits  each  one,  is  tested  with 
number  of  test  vectors  found  by  equation  (3). 


(3) 


a minimum 


Figure  1:  BDD  of  a Comparator 


3.3  A Multiplexor 

Let  us  study  now  the  case  of  a multiplexor  of  two  variables.  The  inputs  A,  B are  given  on 
n bits  while  the  control  signal  C on  one  bit.  The  output  5,  which  is  SnSn-l..Si..S2S1  is 
equal  to  A if  C is  true,  or  else  it  is  equal  to  B . 

The  multiplexor  can  be  considered  in  this  case  as  a set  of  elementary  multiplexors 
where  the  il  bit  of  the  output  is  given  as  follows,  (i  — 1, n): 

Si  = C.Ai  + C.Bi  (4) 

The  CTM  of  the  BDD  describing  Si  is  equal  to  4.  Then,  supposing  ai,bi,c,Si  the  VTMs 
of  Ai,Bi,C,Si  respectively,  according  to  equations  (1)  and  (2),  s,  is  computed  as  follows, 
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•s»  — ai  + bi  + c + 2 (5) 

Finally,  if  A and  B are  primary  inputs,  which  means  a,-  = 6*  = 2,  (t  = 1,  ...,n),  according 
to  (5),  Si  is  given  as  follows: 


Si  = c + 2 


(6) 


3.4  An  (n,n)  Adder 

Let  us  consider  now,  an  adder  of  two  variables  A an  B given  on  n bits,  called  an  (n,n)  adder. 
The  sum  S is  given  on  (n  + 1)  bits  and  is  Sn+\Sn..Si..SiS\.  Considering  the  elementary 
full-adder,  we  have: 


5<  = A©B<©fli-x  (7) 

Ri  = M(Ai,Bi,Ri- 1)  (8) 

where  5*  is  the  ith  bit  of  the  sum  and  R{  the  carry  out  of  this  addition.  The  CTMs  of 
the  BDDs  describing  Si  and  R{  are  7 and  5 respectively.  Then,  if  Si,  r,_i  and  r,  are  the 
VTMs  of  i and  R,  respectively,  according  to  equation  (2)  Si  and  r,  are  computed 

as  follows,  ( i — 2, 7i  — 1,  n): 


Si  — 7 + — 2)  + {hi  — 2)  -f  (?\-_i  — 2)  (9) 

7*1  = 5 + (a*  — 2)  + ( bi  — 2)  + (ri-i  — 2)  (10) 

In  the  case  of  the  half- adder,  we  have: 

Si  = A1(&Bl  (11) 

RX=M{AX,BX,  0)  (12) 


Then,  supposing  sx  and  rx  the  VTMs  of  Sx  and  Rx  respectively,  sx  and  rx  are  computed 
as  follows: 


sx  = 4 + ( ax  — 2)  + ( bx  — 2)  (13) 

tx  = 3 + (ai  — 2)  + (&i  ~ 2)  (14) 

Noting  ap  and  bp  the  VTMs  of  Ap  and  Bp  respectively  ( p = 1,  ...,i),  the  VTM  s*  of  Si, 
the  iih  bit  of  5,  is: 

i 

Si  — 5Z(ap  + M + (2  - i) 

p=i 


(15) 
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OPERATOR 

K (2  bits) 

K (3  bits) 

Addition 

12 

23 

Subtraction 

12 

23 

Comparator 

7 

11  ! 

Multiplexor 

8 

12 

Multiplication 

9 

24 

AND  / OR 

3 

3 

Table  1:  Operator-Cost  Coefficients 

3.5  Objective  function 

It  is  common  practice  to  use  the  cost  function  in  the  high-level  synthesis,  considering  delay, 
area  for  testability  constraints  [5].  In  this  work^we  propose  a new  objective  function  able 
to  estimate  and  evaluate  particularly  the  testability  and  the  area  constraints  of  a circuit. 
In  a first  stage,  we  propose  to  define  a coefficient  K for  each  functional  primitive  as  the  sum 
of  its  output  VTMs,  while  it  only  has  primary  inputs,  the  value  of  K depends  essentially 
on  the  bit  number  of  the  given  functional-primitive  variables. 

In  the  case  of  logical  operators  or  specific  blocks  where  variables  are  given  on  one  bit 
such  as  AND,  OR.  gates,  K is  the  output  VTM.  Table  1 gives  the  costs  of  some  common 
functional  blocks  operating  with  variables  given  on  2 bits  and  3 bits  respectively.  This 
coefficient  will  be,  for  a given  functional  primitive,  its  cost  coefficient  in  the  objective 
function  introduced  bellow: 

3.6  Definition 

Let  us  consider  a circuit  C composed  of  n- connected  functional  primitives.  Given  the  ith 
functional  block  ( i = l,...n),  let  us  assume: 

mil  the  sum  of  the  bit  numbers  of  its  outputs. 
ai,pi  ( P = 1?  the  VTMs  of  this  block  outputs. 

K{‘.  the  cost  of  this  block. 

The  objective  function  is  defined  as  follows: 

n rn, 

/ = E^*(  £<■.,)  (i6) 

t = l p~ 1 

The  function  given  by  equation  (16)  shows  a trade-off  between  the  circuit  area  (func- 
tional primitives  used),  and  the  number  of  test  vectors  or,  in  other  words,  the  test  time. 
One  of  our  goals  then,  by  using  this  new  measure  in  high-level  synthesis,  can  be  expressed 
as  a question  of  minimizing  this  objective  function. 


i in  'Himn 
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Figure  2:  (a)  Example  circuit  (b)  Example  circuit  modified 


The  example  presented  in  figure  2(a)  is  modified  in  2(b)  witfOU\;han^negat^e“ai“ 
. , of  the  cirCUit  We  notice  lower  VTM  values  against  a greater  silicon  area  d 

sZonl  m^  used  in  figure  2(a).  Thrs  transformation  shows  the  trade-off  drscussed 

above. 


4 Conclusion 

the  design  process  of  a circuit.  Our  approach  rs  e^entraUy  b^ed  on  the  Br  y 
Diagram  (BDD)  and  the  Cyclomatic  Testability  Measure  (CTM).  We  have  pr  p 
objective  function  to  estimate  circuit-testability  cost. 
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Controlling  State  Explosion  During 
Automatic  Verification  of  Delay-Insensitive  and 
Delay-Constrained  VLSI  Systems 
Using  the  POM  Verifier1 

D.  Probst  and  L.  Jensen 
Department  of  Computer  Science 
Concordia  University 
1455  de  Maisonneuve  Blvd.  West 
Montreal,  Quebec  Canada  H3G  1M8 

Abstract-  Delay-insensitive  VLSI  systems  have  a certain  appeal  on  the  ground 
due  to  difficulties  with  clocks;  they  are  even  more  attractive  in  space.  We  an- 
swer the  question,  is  it  possible  to  control  state  explosion  arising  from  various 
sources  during  automatic  verification  (model  checking)  of  delay-insensitive  sys- 
tems? State  explosion  due  to  concurrency  is  handled  by  introducing  a partial- 
order  representation  for  systems,  and  defining  system  correctness  as  a simple 
relation  between  two  partial  orders  on  the  same  set  of  system  events  (a  graph 
problem).  State  explosion  due  to  nondeterminism  (chiefly  arbitration)  is  han- 
dled when  the  system  to  be  verified  has  a clean,  finite  recurrence  structure. 
Backwards  branching  is  a further  optimization.  The  heart  of  this  approach  is 
the  ability,  during  model  checking,  to  discover  a compact  finite  presentation 
of  the  verified  system  without  prior  composition  of  system  components.  The 
fully- implemented  POM  verification  system  has  polynomial  space  and  time 
performance  on  traditional  asynchronous-circuit  benchmarks  that  are  expo- 
nential in  space  and  time  for  other  verification  systems.  We  also  sketch  the 
generalization  of  this  approach  to  handle  delay-constrained  VLSI  systems. 

Keywords:  delay-insensitive  system,  model  checking,  state  explosion,  partial-order  rep- 
resentation,  recurrence  structure,  state  encoding,  delay-constrained  reactive  system. 


1 Introduction 

Delay-insensitive  systems  are  motivated  by  difficulties  with  clock  distribution  and  compo- 
nent composition  in  clocked  systems  [1,2, 5, 9].  In  a delay-insensitive  system,  modules  may 
be  interconnected  to  form  systems  in  such  a way  that  system  correctness  does  not  depend 
on  delays  in  either  modules  or  interconnection  media.  Gate-level  implementations  of  mod- 
ules whose  specifications  are  delay-insensitive  are  often  themselves  quasi-delay-insensitive; 
essentially,  the  assumption  of  isochronic  forks  allows  one  gate  to  handshake  on  behalf  of 

iThis  research  was  supported  by  the  Natural  Sciences  and  Engineering  Research  Council  of  Canada  under 
grants  A3363  and  MEF0040121.  Email:  probst®crim.ca. 
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another.  Most  interesting  are  delay-constrained  reactive  systems,  in  which  either  outputs 
or  inputs  or  both  must  appear  in  some  temporal  window  relative  to  enabling  inputs  or 
outputs.  Hardware  systems  in  space  make  delay  insensitivity  even  more  attractive  due 
to  (i)  pervasive  asynchronous  communication,  and  (ii)  extremely-low-power  applications. 
Delay  insensitivity  has  a naturaTEnk  to  controlling  state  explosion  during  automatic  veri- 
fication; the  simple  enabling  relations  in  delay-insensitive  control  systems  make  it  easy  to 
discover  a solution  to  the  state- explosion  problem  based  on  causality  checking.  To  build  an 
automatic  verifier  based  on  causality  checking,  you  need  two  things:  (%)  an  expressive  fi- 
nite partial-order  representation  strategy  that  explicitly  distinguishes  concurrency,  choice 
and  recurrence,  and  (ii)  a “goal-directed”  state-encoding  strategy  that  is  both  compre- 
hensive (includes  all  causality)  and  minimal  (has  fewest  states) — the  last  for  performance 
reasons.  Given  these  two  things,  you  can  combine  the  best  features  of  automata-based  and 
partial- order-based  computational  verification  methods. 


2 Behavior  Automata 

The  basic  automata  used  to  represent  processes  are  called  behavior  automata,  which  can 
be  unrolled  to  produce  event  structures  (essentially  sets  of  partially-ordered  computations 
with  all  branching  due  to  conflict  resolution  made  explicit)  [5-8].  Partial  orders  and  con- 
current computation  are  discussed  in  [3].  Restrictions  on  behavior  automata  trade  off 
between  expressiveness  and  processability  (e.g.,  the  efficiency  of  verification  algorithms) 
[8].  The  most  important  rules  for  delay  insensitivity  are  (cf.  [10]): 

Rule  1 Any  two  events  at  the  same  port  in  a partially-ordered  computation  are  order- 
separated  by  at  least  one  event  at  some  other  port. 

Rule  2 There  is  no  immediate  order  relation  between  two  input  events  or  two  output 
events.  Each  ordering  chain  is  an  infinite  sequence  of  strictly  alternating  input  and 
output  events. 

We  seek  abstract,  i.e.,  black-box,  specifications  [4].  For  this  purpose,  behavior  automata 
are  constructed  in  three  phases.  First,  there  is  a deterministic  finite-state  machine  (stick 
figure)  that  expresses  both  conflict  resolution  (choice)  and  recurrence  structure.  This  is  a 
“small”  automaton  relative  to  the  full  transition  system.  Second,  there  is  an  expansion  of 
dfsm  transitions  (sticks)  into  finite  posets,  with  additional  machinery  (sockets)  to  define 
possibly  nonsequential  concatenation  of  posets.  Third,  there  is  an  iterative  process  of 
labeling  successor  arrows  in  posets,  which  terminates  with  an  appropriate  state  encoding. 

We  sketch  the  formal  definition  of  behavior  automaton.  Given  disjoint  alphabets  Act 
(process  actions),  Arr  (successor-arrow  labels),  Com  (dfsm  transitions)  and  Soc  (sockets), 
first  define  Pos  as  the  set  of  finite  labeled  posets  over  Act  U Soc.  Each  member  of  Pos  is 
a labeled  poset  (B,  T,  u),  where  (i)  T is  a partial  order  over  B C Act  U Soc,  and  (ii)  v\  fl 
— > Arr  assigns  a label  to  each  element  in  the  successor  relation  fl  (the  transitive  reduction 
of  T).  A behavior  automaton  is  a 3-tuple  (D,  77,  o),  where  (i)  D is  a dfsm  over  Com,  (ii) 
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rj:  Com  ->  Pos  maps  dfsm  transitions  to  labeled  posets,  and  (Hi)  o:  Soc  powerset(Act) 
maps  sockets  to  sets  of  process  actions.  Map  o defines  which  process  actions  can  “plug  in” 
to  an  empty  socket  when  a poset  command  is  concatenated  to  a sequence  of  earlier  poset 
commands  as  defined  by  dfsm  D. 

A C-element  has  two  input  ports  a and  b,  and  an  output  port  c.  Two  actions  are 
possible  at  a given  port  depending  on  whether  the  signal  transition  is  rising  (+)  or  falling 
(_).  There  is  no  conflict  resolution  (choice),  and  the  recurrence  structure  of  D is  a simple 
loop.  Transitions  (sticks)  concatenate  sequentially  in  this  example,  shown  in  Fig.  1.  Both 
the  reset  action  and  action  c"  can  fill  the  unique  socket  in  this  poset.  Digit  colons  identify 
dfsm  D vertices. 


Figure  1:  Behavior  automaton  for  a C-element. 


In  the  absence  of  conflict  resolution,  each  enabled  output  action  must  be  performed 
eventually  (indicated  by  bracketing).  The  use  of  both  dashed  and  solid  arrows  is  a visual 
reminder  that  a process  specification  contains  both  an  interprocess  protocol  (given  by 
the  dashed  arrows)  and  an  intraprocess  protocol  (given  by  the  solid  arrows).  Here,  the 
state  encoding  (arrow  labeling)  is  essentially  fixed;  since  the  state  is  encoded  as  the  set 
of  successor  arrows  crossing  from  the  past  to  the  future,  i.e.,  crossing  a consistent  cut 
produced  by  a partial  execution,  using  fewer  arrow  labels  would  alter  the  enabling  relation 
of  the  C-element. 

The  semantics  are  straightforward.  For  example,  action  a+  is  enabled  in  any  state 
containing  arrow  1;  when  it  is  performed,  arrow  1 is  removed  from  the  state  and  arrow  3 
is  added.  Similarly,  action  c+  is  enabled  and  required  (because  of  the  bracket)  in  any  state 
containing  arrows  3 and  4.  When  it  is  performed,  arrows  3 and  4 are  removed  from  the 
state  and  arrows  5 and  6 are  added.  Action  c has  preset  and  postset  given  by.  {7,  8}  c 


{1,2}. 

Behavior  automata 


are  more  interesting 


when  branching  is 


involved.  A delay-insensitive 


arbiter  has  two  input  ports  a and  b,  and  two  output  ports  c and  d.  It  grants  exclusive 
access  to  one  of  two  competing  clients  at  a time.  The  behavior  automaton  is  shown  in  Fig. 


Cl  4 

Clients  follow  a four-cycle  protocol.  (A)  = c+]  — ♦ a"  and  ( B ) = d+ ] — * b are  the 
two  critical  sections.  The  labeling  shown,  if  completed,  would  be  conservative  (the  state 
encoding  includes  all  causality,  but  is  not  minimal).  Having  arrows  8,  9 and  10  in  state 
encodings  indicates  who  made  the  token  available  (viz.,  first  client,  second  client  and 
reset  action).  These  three  arrows  are  distinct  instances  of  causality  that  must  be  checked 

separately.  Still,  there  are  too  many  state  encodings. 

We  can  group  arrows  8,  9 and  10  into  an  equivalence  class  t.  This  does  not  alter  the 
enabling  relation.  Consider  performing  action  c+  in  state  {1,  5,  t}.  Causality  checking 
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Figure  2:  Behavior  automaton  for  a delay-insensitive  arbiter. 


of  arrow  t requires  backing  up  in  the  behavior  automaton  to  both  possible  sources,  viz., 
actions  a”  and  b“.  In  state  {1,  5,  t},  c+  and  d+  are  concurrently  enabled  but  conflicting 
actions.  Verification  algorithms  that  process  behavior  automata  perform  both  forwards 
branching  (conflict  resolution)  and  backwards  branching  (examination  of  distinct  recent 
pasts). 

After  equivalenced  arrow  t has  been  defined,  we  can  complete  the  picture  in  Fig.  2 to 
make  it  match  the  formal  definition  (the  labeled  arrows  leaving  posets  are  derivable  from 
map  o).  Consider  the  second  poset  command.  The  top  socket  is  filled  only  by  action  a+; 
its  arrow  is  labeled  1.  The  middle  socket  is  filled  by  any  of  the  actions  a”,  b~  and  reset; 
its  arrow  is  labeled  t.  The  remaining  (interior)  poset  arrows  are  given  arbitrary  distinct 
labels.  • 

3 Correctness  as  a Graph  Problem 

We  define  correctness  by  using  the  mirror  mP  of  specification  P as  a conceptual  imple- 
mentation tester  [1].  We  form  an  imaginary  closed  system  S by  linking  mirror  mP  of 
specification  P to  the  implementation  network  of  processes  Net.  This  produces  an  infinite 
pomtree  (event  structure)  of  system  events  on  which  two  partial  orders  are  defined;  sys- 
tem correctness  is  then  expressible  as  a simple,  easily-checked  relation  between  the  partial 
orders.  The  standard  model-independent  notion  of  correctness  is  as  follows.  Is  there  a 
failure  somewhere,  causing  system  S to  become  undefined?  Does  the  system  just  stop, 
violating  fundamental  liveness?  Is  some  progress  requirement  of  P violated?  Is  there 
(program-detectable)  nondeterminate  livelock  in  S so  that  an  appeal  to  fairness  of  sys- 
tem components  is  necessary  to  assert  progress?  Is  some  conflict  corresponding  to  output 
choice  in  P resolved  unfairly? 

Mirror  mP  is  formed  by  inverting  the  type  of  P’s  actions  and  the  causal/noncausal 
interpretation  of  P’s  successor  arrows,  turning  P’s  dashed  arrows  into  solid  arrows  and 
vice  versa.  Brackets  are  preserved  unchanged.  Every  action  that  can  be  performed  in  S is 
a linked  (output  action,  input  action)  pair.  As  a result,  we  can  check  whether  intraprocess 
protocols  support  interprocess  protocols  in  closed  system  S. 

We  bootstrap  the  dashed  (noncausal,  interprocess  protocol)  and  solid  (causal,  intrapro- 
cess protocol)  relations  from  process  actions  to  system  actions,  defining  an  event  structure 


mk\\. 
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(sometimes  called  pomtree)  with  a noncausal  enabling  relation  on  top  of  the  usual  causal 
enabling  one.  For  example,  a noncausal  predecessor  of  system  action  a is  found  by  locat- 
ing the  embedded  process  input  action,  stepping  back  along  a dashed  process  arrow,  and 
returning  to  the  system  alphabet.  We  have  thus  defined  “noncausal  preset”  of  a system 
action.  Essentially,  the  safety  correctness  relation  is:  whenever  a dashed  arrow  links  two 
system  actions,  a chain  of  solid  arrows  must  also  link  the  two  actions. 

Let  o'  be  a system  action  that  is  causally  enabled  in  S.  There  is  a safety  violation  at  a 
unless 

(a)  its  noncausal  preset  is  also  causally  enabled  in  S,  and 

(b)  each  member  of  its  noncausal  preset  is  a causal  ancestor  of  a. 

The  causal  preset  of  a is  defined  only  when  cr  is  a bracketed  system  action:  it  is  the  set 
of  nearest  performances  of  linked  mP  output  actions  on  any  causal  chain  coming  into  a . 
In  order  that  a bracketed  a in  S is  neither  a safety  nor  a progress  violation,  it  is  necessary 
that  the  causal  and  noncausal  presets  of  a match  exactly.  When  backwards  branching  is 
present  in  S,  these  conditions  are  generalized  to  hold  along  each  distinct  past  (backwards 
branch).  Backwards  branching  is  necessary  to  resolve  multiple  sources  of  equivalenced 
arrows. 


4 Model  Checking 

The  algorithm  is  straightforward.  Starting  from  system  reset,  we  enumerate  causally- 
enabled  system  actions  and  visit  one  system  cut  per  action.  We  consider  each  enabled 
action  in  a state  produced  by  some  partially-ordered  past  that  we  have  generated.  First, 
we  repeatedly  step  back  across  single  dashed  arrows  to  compute  the  action’s  noncausal 
preset.  Second,  we  repeatedly  (finitely)  chain  back  across  multiple  solid  arrows  to  compute 
the  action’s  partial  causal  ancestor  set  (or  causal  preset  if  the  action  is  bracketed).  When 
equivalenced  arrows  are  encountered,  we  branch  backwards  to  check  each  possible  source. 
The  speedup  is  due  to  two  effects: 

1.  we  effectively  check  cuts  in  the  generated  past  that  we  have  passed  by  without  vis- 
iting, and 

2.  for  equivalenced  arrows,  we  effectively  check  cuts  in  pasts  that  we  have  not  generated. 

This  kills  state  explosion  due  to  concurrency  and/or  nondeterminism.  We  traverse 
each  determinate  segment  (stick)  of  the  implicitly  constructed  system  behavior  automaton 
(stick  figure)  precisely  once.  Backwards  branching  catches  all  causality  that  would  have 
been  visible  had  we  traversed  the  system  stick  figure  in  some  other  way.  Example  system 
stick  figures  are  shown  in  Fig.  3. 

We  keep  the  termination  table  small  by  making  the  mapping  from  P states  to  S states 
one-to-few  rather  than  one-to-many.  This  is  possible  when  all  behavior  automata  have 
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Figure  3:  System  stick  figures  for  the  n-DME  verification  problem, 

visible  branching  ancf  recurrence  structure.  Explicit  structure  in  each  component  allows 
the  verification  algorithm  to  uncover  a structure  in  system  S.  In  particular,  when  we  cycle 
in  P,  we  can  arrange  to  cycle  in  S.  As  a result,  termination  is  achieved  by  checkpointing 
very  few  global  states  of  system  S.  The  top  level  of  the  algorithm  visits  system  actions  and 
tries  to  complete  P sticks.  The  lower  level  of  the  algorithm  does  arrow  checking. 


5 Output-delay-constrained  reactive  systems 

To  fix  ideas,  consider  a hardware  system  that  is  a space-based  component  of  a missile 
defense  system;  this  component  receives  massive  amounts  of  target-acquisition  data  asyn- 
chronously, and  is  required  to  process  it  in  real  time  and  communicate  the  result . There 
are  two  types  of  delay  constraint  that  could  appear  in  a requirements  specification  of  such 
a component,  which  is  a typical  reactive  system.  First,  there  could  be  a temporal  interval, 
relative  to  the  arrival  of  a complete  problem  instance,  during  which  the  component  must 
respond;  this  is  an  output  delay  constraint.  Second,  there  could  be  a temporal  interval,  rel- 
ative to  the  departure  of  the  previous  result  and/or  the  arrival  of  other  input,  during  which 
the  external  world  can  safely  stimulate  the  component;  this  is  an  input  delay  constraint. 
The  simplest  delay- constrained  reactive  systems  are  those  in  which  delay  constraints  are 
imposed  only  on  the  intraprocess  protocol,  i.e.,  on  module  response;  in  this  case,  the 
mechanism  that  ensures  input  safety  is  unchanged  (the  interprocess  protocol  is  still  real 
or  virtual  handshaking).  The  difficult  case  is  an  interprocess  protocol  that  specifies  when 
the  module  can  be  overwhelmed  by  high-bandwidth  input;  we  leave  the  difficult  case  for 
future  work.  In  our  representation,  minimum/maximum- delay  information  is  expressed  by 
putting  timing  windows  directly  on  output  actions.  Minimum-delay  information  may  be 
freely  entered  on  successor  arrows,  but  maximum-delay  semantics  is  constrained  by  ques- 
tions of  physical  realizability.  We  choose  the  following  uniform  semantics.  If  bracketed 
output  action  c is  annotated  with  the  temporal  interval  (tmin,  tmax),  then  action  c will 
be  performed  no  earlier  than  tmin  units  and  no  later  than  tmax  units  after  the  holding  of 
its  preset  pre(c).^::r“  - * ; : 

The  standard  verification  algorithm  for  precedence  constraints  (described  in  section 
4)  can  easily  be  extended  to  check  these  new  delay  constraints.  When  checking  for  a 
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(precedence)  safety  violation  at  system  action  <7,  we  determine  whether  there  is  a causal 
chain  to  a from  each  member  of  cr’s  noncausal  preset,  say,  pre(cr).  First,  copy  the  timing 
window  on  each  output  action  to  each  of  its  predecessor  arrows.  Second,  find  the  sums 
of  tmin  and  tmax  along  all  causal  chains  to  a from  each  member  of  its  noncausal  preset 
pre((r).  Consider  the  maximum  delay  case.  For  t G pre(<r ),  define  D(r , c)  as  the  maximum 
sum  of  tmax  values  along  any  causal  chain  from  r to  <r.  Then  system  action  cr  will  be 
performed  no  later  than  max  over  r of  D(r,  a ) units  after  the  holding  of  its  noncausal 
preset  pre(o-).  For  the  minimum  delay  case,  define  d(r,  cr)  as  the  maximum  sum  of  tmin 
values,  and  take  the  min  over  t of  d(v,  <r);  <r  will  be  performed  no  earlier  than  this  many 
units  after  the  holding  of  its  noncausal  preset. 


6 Conclusion 

A complete  verification  package  has  been  written  by  Lin  Jensen  in  the  Trilogy  program- 
ming language  running  on  an  IBM  PC.  The  POM  system  has  polynomial  space  and  time 
performance  on  benchmarks  that  are  exponential  in  space  and  time  for  other  verification 
systems.  Consider  the  ring  of  DME  elements  benchmark.  The  runtime  for  verification  of 
both  safety  and  progress  properties  is  quadratic  in  n,  the  number  of  DME  elements.  The 
number  of  system  states  grows  exponentially  with  n.  For  example,  when  when  n = 9,  the 
time  is  180  s (roughly  109  states);  when  n = 10,  the  time  is  220  s (roughly  1010  states).  The 
space  requirements  for  these  problems  do  not  exceed  64K  bytes,  i.e.,  one  IBM  PC  data  seg- 
ment. What  are  the  compiler-independent  space  requirements?  One  must  store  the  input; 
this  is  linear.  One  must  store  the  termination  table;  this  is  quadratic.  Given  reasonable 
garbage  collection,  the  working  storage  to  do  backwards  chaining  in  a partially-ordered 
system  computation  is  linear,  because  one  constructs  and  compares  simple  presets.  The 
limiting  resource  is  the  quadratic  space  used  to  store  the  termination  table.  To  repeat, 
both  space  and  time  are  quadratic,  in  this  example,  to  verify  a concurrent  system  with 
exponentially  many  states.  Building  up  the  actual  partially-ordered  system  computations 
themselves  is  unnecessary;  we  work  directly  with  the  uncomposed  behavior  automata  of 
the  system  components.  We  have  also  shown,  at  least  in  the  simple  case  of  output-delay- 
constrained  reactive  systems,  that  verifying  temporal  window  constraints  is  barely  more 
expensive  than  verifying  precedence  constraints.  In  general,  the  achievable  efficiency  of 
a real-time  verification  algorithm  is  a sensitive  function  of  the  precise  abstraction  of  real 
time  used  in  the  model. 
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Formal  Verification  of  an  MMU  and  MMU  Cache 

E.  T.  Schubert 
Division  of  Computer  Science 
University  of  California,  Davis 

Abstract  - We  describe  the  formal  verification  of  a hardware  subsystem  consist- 
ing of  a memory  management  unit  and  a cache.  These  devices  are  verified  in- 
dependently and  then  shown  to  interact  correctly  when  composed.  The  MMU 
authorizes  memory  requests  and  translate  virtual  addresses  to  real  addresses. 
The  cache  improves  performance  by  maintaining  a LRU  list  from  the  memory 
resident  segment  table. 

1 Introduction 

Computers  are  being  used  in  areas  where  no  affordable  level  of  testing  is  adequate.  Safety 
and  life  critical  systems  must  find  a replacement  for  exhaustive  testing  to  guarantee  their 
correctness.  Through  a mathematical  proof,  hardware  verification  can  formally  demon- 
strate that  a design  satisfies  its  specification.  However,  hardware  verification  research  has 
focused  on  device  verification  and  has  largely  ignored  system  composition  verification  [1]. 
Our  research  is  directed  towards  developing  a methodology  to  verify  a hardware  base  for 
a safety  critical  system.  The  top  level  hardware  specification  is  apt  to  suggest  a unitary 
implementation.  This  abstraction  is  convenient  for  verifying  the  correctness  of  software, 
however,  the  implementation  consists  of  many  different  interacting  components  (CPU, 
memory,  coprocessors,  I/O  devices,  bus  controllers,  interrupt  controllers,  etc).  This  paper 
will  describe  our  efforts  to  verify  a subsystem  consisting  of  a MMU  and  its  cache  using  the 
HOL  theorem  prover  [2]. 

The  abstract  MMU  reported  in  [3]  assumed  a memory  model  where  a read  request  was 
satisfied  in  one  cycle.  We  extend  the  MMU  to  interact  with  an  asynchronous  memory. 
Additionally,  the  memory  is  more  fully  described;  providing  read  and  write  functions. 
These  changes  required  several  significant  changes  to  the  abstract  MMU  proof  script.  The 
original  proof  strategy  took  advantage  of  the  single  cycle  response  time.  The  new  strategy 
must  use  two  arbitrary  contents  to  define  when  memory  words  are  returned  from  the 
memory-cache  subsystem. 

1.1  Related  Work 

Hardware  verification  requires  that  the  design  of  a system  is  formally  shown  to  satisfy 
its  specification  through  a mathematical  proof.  Using  theorem  proving  techniques,  an 
expression  describing  the  behavior  of  a device  is  proven  to  be  equivalent  in  some  sense 
to  an  expression  describing  the  implementation  structure  of  the  device.  These  expressions 
concisely  describe  the  behavior  of  devices  in  an  unambiguous  way.  An  additional  benefit  of 
hardware  verification  is  that  the  behavioral  semantics  of  the  hardware  are  clearly  defined. 
This  provides  an  accurate  basis  for  building  correct  software  systems  [5]. 


8.3.2 


Hardware  verification  efforts  thus  far  have  focused  primarily  on  a microprocessor  as 
the  base  for  computer  systems  [6],  [7],  [8],  [9].  The  processors  verified  have  modeled  small 
instruction  sets  and  generally,  have  not  included  modern  CPU  features  such  as  pipelines, 
multiple  functional  units  and  hardware  interrupt  support.  Tamarack-3  [9]  and  AVM-1  [10] 
do  provide  sufficient  interrupt  support  to  connect  with  an  interrupt  controller.  However,  no 
system  currently  verified  provides  the  memory  management  functions  necessary  to  support 

a secure  operating  system w _ 

Previous  efforts  to  verify  systems  have  constructed  vertically  verified  systems  with  a 
microprocessor/memory  as  the  system’s  base  [11],[5],[1].  These  efforts  have  aimed  at  il- 
lustrating how  hardware  verification  can  be  used  to  close  the  semantic  gap  between  high 
level  languages  and  the  computers  instruction  set.  However,  the  base  for  these  systems  (a 
microprocessor-memory  pair)  has  been  an  unrealistic  hardware  platform. 

1.2  HOL 

The  object  language  of  HOL  is  a formulation  of  higher-order  logic.  Universally  quantified 
variables  are  used  to  specify  input  and  output  device  lines  while  internal  device  lines  are 
existentially  quantified.  Conditional  expressions  are  in  the  form:  cond  — ►then-clause  | 
else-clause.  ' ' ^ 

HOL  provides  the  human  verifier  with  a selection  of  tactics  for  use  in  goal-directed 
proofs.  The  tactics  are  very  similar  to  the  kinds  of  steps  a human  theorem  prover  would 
take  in  solving  a goal.  New  tactics  can  be  written  that  allow  the  theorem  prover  to  be 
extended  and  customized  for  a particular  task.  New  theorems  can  only  be  created  in 
a controlled  manner.  AH  proofs  can  be  reduced  to  one  containing  only  the  8 primitive 
inference  rules  and  5 primitive  axioms.  High-level  inference  rules  and  tactics  derived  from 
some  combination  of  primitive  inference  rules. 

The  following  HOL  expression  defines  an  and  gate  implementation  using  an  inverter 
and  a nand  gate.  The  existentially  quantified  variable  p,  represents  an  internal  fine  which 
links  the  output  of  the  nand  gate  with  the  input  of  the  inverter. 
b^andGate  a b out  = Bp.nand  a b p A inv  p out 


2 Memory  Management  Unit 

[12]  describes  a number  of  memory  management  units  which  form  a complexity  hierarchy. 
By  developing  a sophisticated  MMU  in  steps,  the  construction  of  the  final  proof  appears 
to  be  more  tractable.  The  simpler  devices  validate  access  to  fixed  length  memory  pages 
while  the  more  complex  devices  authorize  read,  write  or  execute  access  to  variable  length 
segments  and  translate  virtual  addresses  to  real  addresses.  Many  of  these  devices  were 
designed  and  verified  to  the  gate  level.  However,  as  the  complexity  increases,  the  emphasis 
of  the  verification  shifts  from  gate  level  connections  to  the  correctness  of  the  operating 
system  support  features.  ■■  -----  _____  _ - — ---------  --  - 

The  device  described  below  validates  memory  requests  based  on  information  maintained 
in  a memory  resident  segment  descriptor  table.  The  location  of  the  table  is  determined  by  a 
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segment  table  pointer  register  which  is  accessible  only  during  supervisor  operations.  Each 
descriptor  consists  of  two  words:  the  first  contains  access  control  information  (present  bit, 
read/write/execute  permissions,  segment  size)  and  the  second  serves  as  the  base  address 
for  the  segment’s  real  location  in  memory.  To  translate  from  a virtual  address  to  a real 
address,  the  MMU  adds  the  segment  offset  to  the  segment  base  address.  The  MMU  assumes 
the  table  provides  an  entry  for  all  possible  segment  descriptors. 

A generic  theory  for  a class  of  MMU  devices  is  defined  where  several  functions  and 
data  types  are  left  abstract.  Using  an  abstract  representation,  details  such  as  word  length, 
can  be  omitted  and  the  verification  focuses  only  on  the  correctness  of  higher  level  abstrac- 
tion (e.g.  electronic  block  level  rather  than  gate  level).  At  a later  point,  the  abstract 
representation  can  be  instantiated  with  components  that  implement  concrete  behavior. 

Support  for  generic  or  abstract  theories  is  not  directly  provided  by  HOL.  However, 
a theory  about  abstract  representations  can  be  defined  in  the  object  language  [10].  An 
abstract  representation  contains  a set  of  uninterpreted  constants,  types,  abstract  operations 
and  a set  of  abstract  objects.  The  semantics  of  the  abstract  representation  is  unspecified. 
Inside  the  theory,  we  do  not  know  what  the  objects  and  operations  mean.  The  abstract 
theory  package  also  creates  a set  of  selector  functions  [11]  to  extract  desired  functions  from 
an  abstract  representation. 

The  abstract  MMU  representation  generalizes  traits  particular  to  concrete  implemen- 
tations. Properties  such  as  the  the  exact  security  policy  and  division  of  a virtual  address 
into  a segment  identifier  and  offset  (as  well  as  the  overall  number  of  bits  in  an  address), 
are  hidden  by  functions  which  given  an  address,  return  the  segment  identifier  or  segment 
offset  field  (segld  and  segOfs,  respectively).  There  is  also  a function  segldshf  which 
returns  the  offset  of  a segment  descriptor  within  the  memory  resident  segment  table  for 
a given  address.  Since  descriptors  require  two  words,  the  implementation  of  this  function 
simply  shifts  the  segment  identifier  to  the  left  one  bit  position  (e.g.  it  adds  a trailing  zero 
bit). 

The  abstract  functions  selected  by  availBit,  readBit,  writeBit  and  execBit  extract 
a bit  value  from  an  argument  of  type  *wordn.  These  functions  are  applied  to  the  first  word 
of  a segment  descriptor. 

Several  functions  which  operate  on  two-tuples  are  available.  Given  a pair  of  *wordn 
values,  add  returns  a value  of  *wordn.  Functions  addrEq,  of  sLEq  and  validAccess  replace 
the  bit  Vector  comparison  units  defined  for  the  more  concrete  units. 

Additional  abstract  coercion  functions  are  available  to  convert  values  between  types.  If 
the  theory  were  instantiated,  the  abstract  types  would  likely  be  implemented  with  bit  Vec- 
tors; leaving  these  functions  unnecessary. 

Memory  is  also  treated  abstractly.  The  abstract  representation  provides  a fetch  function 
fetch. 
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let  mum  abe  = 

c 

(‘segld* , 

( ‘segOf s ‘ , 
(‘segldshf 
('availBit' , 
('readBit' , 
('writeBit' , 
('execBit', 
(‘add1, 

( 'addrEq' , 
('ofsLEq' , 

( 'validAccess 


new.abstract .representation 

" :(*address  ->  *wordn)" 
":(*address  ->  *wordn)" 
":(*address  ->  *woxdn)" 
":(*wordn  ->  bool)" 

" : (*wordn  ->  bool)" 

" : (*vordn  ->  bool)" 

":(*wordn  ->  bool)" 

":(*wordn  # *wordn  ->*Hordn)" 
":(*address  # *address  ->  bool) 
" : (^address  # *wordn  ->  bool) 


II 

It 


) 

) 

) 

) 

) 

) 

) 

) 

) 

) 


I 

* 

I 


(*val‘, 

( 'wordn' , 

('address' , 

( ‘fetch' , 

1 * • 

J » I 


(♦address  # ♦wordn  # RWE  ->  bool)"): 
":(*wordn  ->  num)"  ); 

":(num->  *wordn)"  )’ 

":(*wordn  ->  *address)"  )| 

" : (*memory  # ^address)  ->  ♦wordn"); 


A type  abbreviation  RWE  is  also  defined  to  be  a three  tuple  of  bit  values.  Selector 
functions  rBIT,  wBIT  and  eBIT  access  the  first,  second  and  third  bits,  respectively. 


2.1  Specification 


The  specification  is  decomposed  into  several  rules  and  ignores  timing  characteristics.  The 
state  and  output  environment  of  the  MMU  specification  is  a three-tuple  consisting  of  a 
boolean  acknowledgment,  a memory  address  and  the  table  pointer  register  value.  The 
variable  r in  the  definitions  below  is  the  abstract  representation. 

Functions  superMode  and  userMode  describe  the  behavior  of  the  MMU  when  operating 
in  their  respective  modes.  legalAccess  uses  many  of  the  abstract  functions  to  feFch 
irom  memory  the  appropriate  segment  descriptor  and  compare  it  with  the  request’s  access 
parameters.  vToR  constructs  a real  address  from  a virtual  address. 


vAddr.  tblPtr  rwe  meni  = let  a = (fetch  r)(mem, (address  r)((add  r)  (segldshf 
vAddr.tblPtr)))  in  ((validAccess  r)  (vAddr,a,rwe)  A (ofsLEq  r)  (vAddr.a)) 


r 


X/  l"^vToR  r vAddr  tblPtr  mem  - let  a = (fetch  r)  (mem, (address  r)((add  r)((wordn  r 1),  (add 
r)(segldshf  r vAddr,tblPtr))))  in  (address  r)  ((add  r)  (segOfs  r vAddr,  a)) 


b,f,/ superMode  r vAddr  rwe  tblPtrADDR  tblPtr  data  mem  = ((wBIT  rwe)  A (addrEq  r 
(vAddr, tblPtrADDR)))  ->(  T,  vAddr,  data  ) — ( T,  vAddr,  tblPtr  ) 

Hje/userMode  r vAddr  rwe  tblPtrADDR  tblPtr  data  mem  = legalAccess  r vAddr  tblPtr  rwe 
mem  ->(  T,  (vToR  r vAddr  tblPtr  mem),  tblPtr  ) — ( F,  vAddr,  tblPtr  ) 


Prfe/mmu-spec  r vAddr  rwe  tblPtrADDR  tblPtr  data  mem  superv  = superv  -+superMode  r 
vAddr  rwe  tblPtrADDR  tblPtr  data  mem  — userMode  r vAddr  rwe  tblPtrADDR  tblPtr  data  mem 
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2.2  Implementation 

The  implementation  is  constructed  from  electronic  block  model  components.  These  are 
defined  as  specifications  for  the  behavior  of  a gate  level  implementation.  Many  of  the 
devices  specify  their  timing  behavior  as  well.  The  building  blocks  consist  of  a security 
comparison  unit,  an  address  match  unit,  a memory  fetch  unit,  an  adder,  registers,  latches, 
muxes,  and  a control  unit.  Most  of  the  device  definitions  are  straight  forward  with  the 
exception  of  the  memory  and  the  control  unit.  These  two  units  will  be  described  in  greater 
detail. 


b</«/ 

b<fc/ 


secUnit_spec  r a b rwe  ok  - V t.  ok  (t  + 1)  — 

( (validAccess  r)  ((a  t),(b  t),(rwe  t))  A (ofsLEq  r)  ((a  t),(b  t))) 
addUnit.spec  r a b c = V t:num.  c (t+1)  « (add  r ( (a  t),(b  t)  )) 

1 muxUnitlspec  r a b out  w =V  t . (out(t+l))=(w(t+l))->  address  r(b(t+l))|(a  t) 

1 - 4cf  mux3Unit_spec  a b c out  w = V t:num. 

(out  t)  = (w  t = 0)— * at  I (w  t = 1)— * b t I c t 
h i'i  splitUnit_spec  r virt  id  ofs  = V t:num. 

((id  t)  = (segldshf  r)  (virt  t))  A ((ofs  t)  = (segOfs  r)  (virt  t)) 

I -dtf  latchUnit.spec  r i out  Ctrl  = V t:num. 

out  (t+1)  = Ctrl  (t+1)  — ♦ out  t I (i  (t+1)) 

\- i'f  regUnit_spec  r i Id  clr  out  = 

(V  f.num.  out(t+l)  = (clr  t— > (wordn  r 0 ) | Id  t — * i t|  out  t)) 

A (out  0 = (wordn  r0))  ^ _ , „„ 

itj  matchUnit_spec  r a b m = V (t:num).  m(t  + l)  - (addrEq  r (a  t,  b t))  — * T I F 


Memory  Unit 

As  a first  step  towards  composing  devices,  the  memory  specification  used  for  the  MMU 
verification  is  significantly  expanded  from  the  model  used  in  [3].  The  earlier  model  assumed 
a read-only  memory  that  returned  a value  one  clock  cycle  after  a request  was  made.  The 
new  model  defines  asynchronous  read  and  write  operations.  This  model  makes  an  implicit 
assumption  that  each  memory  request  is  satisfied  before  the  next  request  is  generated. 
Most  of  the  new  proof  effort  centered  on  establishing  the  correctness  of  the  MMU  control 
unit  with  the  new  memory  specification. 


\-itf  memoryUnit.spec  r req  rwe  addr  data  done  mem  = 

(done  0 = F)  A 
(V  t.  (req  t)  — ♦ 

(3  t’.  Next  done  (t,  t+t’)  A 
(wBIT  (rwe  t)  => 

( (mem  (t+t*)  = store  r (mem  t,addr  t,data  t)  ) ) I 
( (data  (t+t’)  = fetch  r (mem  t.addr  t)  ) A 
(mem  (t+t’)  = mem  t)  ) ) ) 

( (done  (t+l)  = F)  A 

(mem  (t+1)  = mem  t)  ) ) 


Control  Unit 

To  process  each  memory  request,  the  control  unit  will  pass  through  several  clocked 
phases.  At  each  clock  tick  the  control  unit  may  change  its  phase  depending  on  the  results 
computed  by  the  other  internal  units  and  the  MMU  input  from  the  system  bus.  The  control 
unit  state  is  maintained  by  the  variable  phase.  There  are  six  distinct  phases,  however, 
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not  all  phases  are  executed  for  each  request.  Which  phases  are  executed  depends  on  the 
validity  of  the  memory  request.  Request  evaluation  begins  with  the  control  unit  in  phase 
0 and  completes  when  phase  0 is  again  reached.  A valid  request  will  require  five  phases 
with  a delay  of  at  least  one  time  unit  before  each  phase  change. 


vAddr 
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Figure  1:  Abstract  MMU  Internal  Block  Diagram 

The  datapath  definition  describes  the  interconnection  between  all  the  units  other  than 
the  control  unit. 


b it}  datapath  r vAddr  vData  rwe  mem  tblPtrADDR  tblPtr  rAddr  muxC 
tmpC  tblC  1C  rReq  xlat  match  secOK  fdone  = 

3 (muxl  mux2  id  ofs  addOut  data  latOut  secData. 

(regUnit.spee  r vData  tblC  bitFaXse  tblPtr)  A 

(regUnit.spec  r data  tmpC  bitFalse  secData)  A 


(secUnit.spec  r vAddr  secData  rve  secOK)  A 

(splitUnit.spec  r vAddr  id  ofs)  A 


(mux3Unit_spec  tblPtr  data  latOut  mux2  muxC)  A 
(addUnit.spec  r muxl  mux2  addOut)  A 

(latchUnit.spec  r addOut  latOut  1C)  A 

(matchUnit_spec  r vAddr  tblPtrADDR  match)  A 

(muxUnit_spec  r vAddr  latOut  rAddr  xlat)  A 

(memoryUnit_spec  r rReq  rAddr  data  fdone  mem) 


The  implementation  definition  connects  the  datapath  with  the  control  unit.  The  state 
consists  of  the  table  pointer  register  value,  the  security  Data  register  and  the  control  unit 
phase  (tblPtr,  secData,  phase).  The  input  environment  is  provided  by  the  system  bus 
and  the  memory  (vAddr,  vData,  rwe,  superv,  reqln,  mem).  The  output  environment 
includes  a real  address  and  several  control  unit  outputs  (rAddr,  done,  ack,  xlat  ).  The 
memory  address  of  the  table  pointer  register  is  specified  by  the  constant  tblPtrADDR. 
Correctness  Statement 

Several  auxiliary  definitions  are  used  to  express  the  final  correctness  statement.  To 
relate  the  implementation  to  the  specification,  a temporal  abstraction  is  constructed  using 
the  two  predicates  Next  and  First[9].  The  predicate  First  is  true  when  its  argument  t is 
the  first  time  that  g is  true.  The  predicate  Next  is  true  when  t2  is  the  next  time  after  tl 
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4tj  controlUnit.spec  reqln  super  rwe  match  secOK  Idone  muxC  tmpC  tblC 

1C  rReq  xlat  done  ack  phase  = 

( (muxC  0 , tmpC  O.tblC  0,1C  O.rReq  O.xlat  O.done  0,ack  0, phase 
0)=(0,F,F,FJF,F,F,F,0)) 


A 


(V  t . (muxC(t+l) ,tmpC(t+l) ,tblC(t+l) 
ack(t+l) ,phase(t+l)  ) = 

(phase  t = 0)  — > (reqln  t — * 

(phase  t = 1)  — ► (super  t -> 

((wBIT  (rwe  t))  A match 


,lC(t+l) ,rReq(t+l) ,xlat(t+l) ,done(t+l) , 

V,  M t t 1 r x d a P '/. 

*/,  U m b a e 1 o c H ’/, 

•/,  X pit  q t n k A ‘/. 

( 0,  F.F.F , F.F.F.F,  1)  I 
( 0.  F.F.F.  F.F.F.F,  0))  I 


((phase  t = 2)  A fdone  t)  — » 
((phase  t = 3)  A fdone  t)  -» 

(phase  t = 4)  — * 

(phase  t = 5)  — + 

(muxC  t , tmpC  t , tblC  t , 1C  t , F , 


t) 


(secOK  t 


( 0. 
( 0, 
( 2, 
( 1. 
( 0. 
( 0, 
( 0. 


F.T.F,  F.F.F.F.  5) 
F.F.F,  F.F.T.T  ,0) 
T,F,T,  T,T,F,F,  2)) 
F.F.F,  T.T.F.F,  3) 
F.F.F,  F.T.F.F , 4) 
F.F.F,  F, F.T.F,  0)) 
F.F.T,  F.T.T.T.  0) 


( 0,  F.F.F,  F.F.T.T  .0) 
xlat  t .done  t.ack  t, phase  t)) 


that  g is  true.  The  predicate  stable_sigs  states  that  between  tl  and  t2  the  MMU  inputs 
will  remain  constant.  

\~4tf  First  g t = (V  p:time.  p<t  =>  — > (g  p))  A (g  t) 

\-dtf  Hext  g (tl,t2)  = (tl<t2)  A 

(V  trtime  . tl<t  A t<t2  =>•  ->  (g  t))  A (g  t2) 

\-^ef  stable_sigs  tl  t2  vAddr  rwe  tblPtrADDR  data 

mem  super  = Vt,.tl<t*  A t*  < t2  => 

(super  t*  = super  tl)  A (vAddr  t*  = vAddr  tl)  A (rwe  t*  — rwe  tl)  A 
(data  t>  = data  tl)  A (tblPtrADDR  tJ  = tblPtrADDR  tl)  A (mem  t>  = mem  tl) 

The  correctness  theorem,  states  that  if  the  implementation  is  in  phase  0 and  a memory 
request  is  made,  the  implementation  will  eventually  respond  (c  time  steps  later),  when 
the  state  of  the  implementation  matches  the  state  defined  by  the  specification  for  a set  of 
given  MMU  inputs.  The  inputs  must  remain  stable  until  the  MMU  responds  to  a request. 
If  a memory  request  is  not  made,  the  acknowledgment  line  remains  F,  the  phase  remains 
0 and  the  MMU  table  pointer  register  remains  unchanged. 


h mmu.imp  r vAddr  vData  rwe  super  tblPtr  tblPtrADDR 
reqln  rAddr  done  ack  xlat  mem  phase  => 
(V  t.  (phase  t = 0)  => 

(reqln  t — > 

(3  c.  Next  done(t,t  + c)  A (phase(t  + c)=0)  A 
(stable_sigs  t (t  + c)  vAddr  rwe  tblPtrADDR 
vData  mem  super  => 

(mmu_spec  r (vAddr  t)  (rwe  t)  (tblPtrADDR  t) 

(tblPtr  t)  (vData  t)  (mem  t)  (super  t) 
= ack(t  + c) , rAddr (t  + c) ,tblPtr(t  + c)))) 

| ( (ack(t  + 1)  = F)  A 
(phase(t  + 1)  = 0)  A 
(tblPtr(t  + 1)  = tblPtr  t)  ) )) 
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3 Memory  Subsystem 

An  initial  design  integrated  a FIFO  cache  stack  inside  the  MMU  but  here  we  model  a fully 
associative  cache  as  part  of  the  memory  subsystem.  The  cache  is  described  as  a lookup 
table  and  implements  a least  recently  used  (LRU)  replacement  strategy.  Each  table  entry 
consists  of  a key,  a related  data  word,  and  a boolean  indicating  whether  the  entry  is  active. 
We  will  first  describe  the  specification  of  the  LRU  replacement  strategy  in  HOL,  followed 
by  the  cache  implementation. 

TAB.ENTRY  = " :booI#*address#*sordn"  : type 
TAB  = (*TAB_EHTRY)Iist"  : type 

\~icf  live  entry  = (FST  entry) 

hj'f  key  entry  = (FST  (SND  entry)) 

f-jef  content  entry  = (SND  (SND  entry)) 


Several  auxiliary  (recursive)  definitions  describe  table  operations  below.  When  an  entry 
is  inserted  into  the  top  of  table,  the  entry  at  the  bottom  will  be  lost  only  when  the  table 
is  “full’7  (all  entries  are  live).  In  this  respect,  the  table  acts  as  a queue. 


1 ~d'f  (TAB.FULL  tbl  0 = live  (EL  0 tbl))  A 

(TAB.FULL  tbl  (SUC  n)  = (live  (EL  (SUC  n)  tbl)  A TAB.FULL  tbl  n)) 
bje/  (TAB.INSERT  tbl  entry  0 = [entry])  A 

(TAB. INSERT  tbl  entry  (SUC  n)  = (APPEND  (TAB.INSERT  tbl  entry  n) 

((TAB.FULL  tbl  n)  — [(EL  n tbl)]  | [(EL  (SUC  n)  tbl)])  ) ) 


A table  lookup  is  successful  if  there  is  a key  match  for  one  of  the  entries.  For  a table 
size  ofn,  TAB  .HIT  returns  (SUC  n)  if  the  lookup  fails. 


b i'f  KEY.HATCH  rep  tbl  sg:*address  n = 

(live(EL  n tbl)  A ((addrEq  rep)  (key(EL  n tbl),  sg))  ) 
b itj  (TAB.HIT  rep  tbl  sg  m 0 = 

((KEY.HATCH  rep  tbl  sg  0)  0 | (SUC  m)))  A 

(TAB.HIT  rep  tbl  sg  m (SUC  n)  = 

((KEY.MATCH  rep  tbl  sg  (SUC  n))  ->  (SUC  n)  | TAB.HIT  rep  tbl  sg  m n)) 


Frequently,  a single  matched  entry  must  be  invalidated.  This  can  occur  due  to  the  LRU 
policy  or  a memory  write  operation.  Occasionally,  the  entire  cache  must  be  invalidated  at 
the  request  of  the  operating  system.  The  LRU  policy  requires  that  if  a key  match  occurs, 
the  entry  be  inserted  at  the  top  of  the  table.  By  invalidating  the  matched  entry  before 
the  insertion,  a table  overflow  will  not  occur.  LRU.L00KUP  returns  the  requested  data  value 
and  the  updated  cache  table. 
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I -def  ENTRY, INVALIDATE  entry  = (F  ,key  entry,  content  entry) 

\-de}  (TAB.INVALIDATE  tbl  0 = [(ENTRY, INVALIDATE  (EL  0 tbl))]  ) A 
(TAB.INVALIDATE  tbl  (SUC  n)  = 

(APPEND  (TAB .INVALIDATE  tbl  n)  [(EHTRY.INVALIDATE  (EL  (SUC  n)  tbl))])) 
\-def  (DEL.TAB.ENTRY  rep  tbl  sg  0 = 

( (KEY.HATCH  rep  tbl  sg  0)  -*  [ ( ENTRY, INV ALID ATE  (EL  0 tbl))]  I 

[(EL  0 tbl)]  ))  A 

(DEL.TAB.ENTRY  rep  tbl  sg  (SUC  n)  = 

(APPEND  (DEL.TAB.ENTRY  rep  tbl  sg  n) 

((KEY.MATCH  rep  tbl  sg  (SUC  n)) 

[(EHTRY.INVALIDATE  (EL  (SUC  n)  tbl))] 

| [(EL  (SUC  n)  tbl)]  ))) 


\-def  LRU. REP L rep  tbl  entry  n = TAB.INSERT  (DEL.TAB.ENTRY  rep  tbl  (key  entry)  n) 

entry  n 

hdef  LRU.LOOKUP  rep  mem  tbl  n addr  data  newTbl  * 
let  who  = (TAB.HIT  rep  tbl  addr  n n)  in 
((who  = (SUC  n)) 

— > ( data  = fetch  rep(  mem,  addr)  A 

newTbl  = TAB.INSERT  tbl  (T, addr , (fetch  rep(mem, addr)  ))  n ) 

| (data  = (content  (EL  who  tbl)  A 

newTbl  = LRU.REPL  rep  tbl  (EL  who  tbl)  n) 


Using  the  above  definitions,  the  cache-memory  subsystem  can  be  defined.  This  defini- 
tion replaces  memoryUnit-spec  in  the  MMU  specification  and  the  new  system  is  verified 
in  a similar  manner.  The  proof  shows  that  the  cache/memory  system  is  consitent  with  the 
MMU  memory  model  requirements. 


\~ dcf  cache _mem_ spec  r req  rwe  addr  data  done  mem  tbl  n - 
(done  0 - F)  A 
(V  t.  (req  t)  — ► 

(3  t’.  Next  done  (t,  t+t>)  A 
(wBIT  (rwe  t)  => 

( (mem  (t+t*)  = store  r (mem  t,addr  t,data  t)  ) A 
(tbl  (t+t 9 ) = DEL.TAB.ENTRY  r (tbl  t)  (addr  t)  n ) I 
( LRU.LOOKUP  r (mem  t)  (tbl  t)  n (addr  t) 

(data  (t+t’))  (tbl  (t+t*))  ) A 
(mem  (t+t’)  = mem  t)  ) ) ) 

I 

( (done  (t  + 1)  = F)  A 

(mem  (t+1)  = mem  t)  A 
(tbl  (t+1)  = tbl  t)  ) ) 


Cache  Implementation 

The  cache  implementation  consists  of  a control  unit  and  a stack  of  cache  cells.  Cache 
cells  are  the  instantiation  of  the  table  entries  described  above — their  state  consisting  of 
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Figure  2:  Cache  Cell  Stack 


the  three  tuple:  (valid,  address  key,  data).  The  action,  of  each  cache  cell  is  defined  by  a 
two  bit  function  code  (req)  sent  by  the  cache  control  unit.  The  stack  is  formed  by  joining 
the  outputs  of  a cache  unit  to  the  inputs  of  the  next.  J , 


1 ~dtf  (cache.blocfc  r state  req  sparcefn  foundln  addr  replState  detain  0 » : 

cache.cell  rep  0 req  addr  replState  (state , sparcoln, foundln, dataln) ) 

A _ 

(cache_block  rep  state  req  sparceln  foundln  addr  replState  dataln  (SUC  n)  = 
(cache. cell  rep  (SUC  n)  req  addr  (EL  n state) 

(cache.block  rep  state  req  sparceln  foundln  addr  replState  dataln  n))) 


b dtf  cache_cell_spec  rep  n req  addr  replState  (stateln, sparceln, .foundln, dataln)  3 
let  state  3 (EL  n stateln ) iii  ^ ^ 

let  match  = ( addrEq  rep(addr ,key  state)  A live  state  ) in 
(req  = (F,F) ) -t  */.  IDLE  '/, 

(stateln,  foundln,  (sparceln  V 'live  state),  dataln  ) I 

(req  = (F.Tl)  ->  7 INVALIDATE  ON  MATCH  '/. 

( mat  ch 

(SET.EL  n stateIn(F,key  state , content  state),  T,  T,  content  state  ) | 
(stateln,  foundln,  (sparceln  V 'live  state),  dataln)  ) | 

(req  3 (T,F))  -4  7 INVALIDATE  */. 

(SET_EL  n stateIn(F,key  state , content  state),  foundln,  T,  dataln  ) | 

y.req  3 (T,T)  — » PUSHDOWN  7 

C sparceln  — > _ 

(stateln,  foundln,  T,  content  state  ) | 

(SET.EL  n stateln  replState,  foundln,  F,  dataln)  ) 


When  a memory  request  is  made,  the  control  unit  signals  each  cache  cell  to  invalidate 
its  entry  if  its  key  matches  the  input  address  (F,T).  Memory  write  requests  are  also  passed 
through  to  memory.  If  a read  request  is  pending  and  the  value  is  not  in  the  cache,  the 
value  is  fetched  from  memory.  We  assume  one  clock  cycle  is  needed  to  read  a value  out  of 
the  cache  if  it  is  available.  After  the  value  fetch  step  is  completed,  the  control  unit  pushes 
the  new  value  onto  the  cache  cell  stack  by  issuing  request  (T,T). 
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To  model  memory,  the  cache  implementation  uses  the  same  memory  unit  specification 
(memoryUnit_spec)  stated  previously.  We  then  verify  that  the  implementation  behaves 
as  specified.  The  implementation  also  provides  a means  of  invalidating  the  entire  table 
(request  (T,F),  however,  this  function  is  not  present  currently  in  the  specification. 


4 Summary 

We  have  described  the  formal  verification  of  an  MMU  and  cache/memory  subsystem. 
The  MMU  has  been  verified  to  perform  correctly  with  an  asynchronous  memory  model. 
The  cache  specification  defines  an  LRU  replacement  policy  which  is  implemented  by  an 
electronic  block  level  design.  The  cache  is  also  demonstrated  to  be  consitent  with  MMU 
memory  model  requirements. 

It  has  been  convenient  to  represent  the  behavior  of  devices  using  abstract  representa- 
tions. This  mechanism  allows  the  verification  effort  to  focus  on  the  correctness  of  higher 
level  abstraction.  To  verify  a more  concrete  implementation,  the  abstract  representation 
can  be  instantiated  with  components  that  implement  concrete  behavior.  Extending  this 
example,  we  plan  to  demonstrate  how  a complete  system  composed  of  many  devices  can 
be  shown  to  correctly  implement  an  abstract  system  specification. 
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Abstract-  The  use  of  formal  methods  to  verify  the  correctness  of  digital  circuits  is 
less  constrained  by  the  growing  complexity  of  digital  circuits  than  conventional 
methods  based  on  exhaustive  simulation.  This  paper  briefly  outlines  three 
main  approaches  to  formal  hardware  verification:  symbolic  simulation, state 
machine  analysis,  and  theorem-proving. 

1 Introduction 

Advances  in  VLSI  fabrication  technology  have  greatly  outstripped  ‘verification  capacity 
— that  is,  the  capacity  of  conventional  methods  for  demonstrating  that  the  design  of  a 
circuit  is  correct  with  respect  to  a specification  of  its  requirements. 

Verification  capacity  has  fallen  behind  fabrication  technology  because  conventional  ver- 
ification methods  do  not  scale  with  complexity.  These  methods  are  generally  based  on 
simulation  — they  do  not  scale  because  the  number  of  simulation  cases  is  likely  to  increase 
exponentially  if  one  attempts  to  maintain  the  same  degree  of  coverage. 

Considerable  effort  has  been  made  to  increase,  in  a brute-force  manner,  what  coverage 
can  be  achieved  with  simulation.  One  approach  is  to  distribute  the  simulation  cases  over 
a large  number  of  machines  running  identical  versions  of  the  simulation  model.  Another 
brute-force  approach  has  been  the  development  of  special-purpose  simulation  hardware 
to  increase  the  speed  of  a simulation  by  several  orders  of  magnitude.  However,  these 
techniques  do  not  offer  a satisfactory,  long-term  solution  for  verifying  digital  designs  by 
exhaustive  simulation  because,  in  general,  the  number  of  simulation  cases  grows  exponen- 
tially with  the  number  of  components  in  a design. 

Of  course,  it  may  be  argued  that  it  is  not  really  necessary,  for  any  practical  purpose, 
to  exhaustively  simulate  a design  in  order  to  detect  every  error  in  a design.  Instead,  it 
would  be  argued  that  it  is  only  reasonable  to  simulate  the  design  for  a feasible  number 
of  representative  cases.  However,  this  assumes  that  there  is  general-purpose,  systematic 
method  for  finding  a truely  representative  set  of  simulation  cases.  Although  one  can  easily 
imagine  a systematic  way  of  generating  some  obvious  cases,  it  is  clear  that  digital  systems 
often  fail  at  the  “confluence  of  unrelated  or  seemingly  unrelated  events”  [25]. 

Formal  methods  offer  considerable  hope  for  verification  techniques  which  are  better  able 
to  scale  with  the  complexity  of  VLSI  designs.  We  can  identify  three  distinct  approaches 
to  formed  hardware  verification,  namely,  symbolic  simulation,  state  machine  analysis,  and 
theorem-proving.  These  formal  approaches  to  hardware  verification  are  better  able  to  scale 
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with  the  complexity  of  VLSI  designs  because  they  exploit  powerful  tools  of  mathematics 
rather  than  brute-force.  A good  example  is  the  use  of  the  mathematical  induction  which  is 
a mainstay  of  the  theorem-proving  approach  to  formal  hardware  verification.  All  of  these 
approaches  are  supported  by  software  tools  — many  of  which  have  been  under  constant 
development  for  the  last  decade  or  longer. 


2 The  Symbolic  Simulation  Approach 

The  concept  of  symbolic  simulation  was  first  proposed  by  researchers  in  the  late  1970’s  as  a 
method  for  evaluating  register  transfer  language  representations  [11].  The  early  programs 
were  very  limited  in  their  analytical  power  since  their  symbolic  manipulation  methods  were 
weak.  Consequently,  symbolic  simulation  did  not  evolve  much  further  until  more  efficient 
methods  of  manipulating  symbols  emerged.  The  development  of  Ordered  Fmary  Decision 
Diagrams  (OBDDs)  for  representing  Boolean  functions  [8]  radically  transformed  symbolic 
simulation. 

The  first  “post-OBDD”  symbolic  simulators  were  simple  extensions  of  traditional  logic 
simulators  [7].  In  these  symbolic  simulators  the  input  values  could  be  arbitrary  Boolean 
expressions  over  some  Boolean  variables  rather  than  only  0’s,  l’s  (and  possibly  X’s)  as  in 
traditional  logic  simulators.  Consequently,  the  results  of  the  simulation  were  not  single 
values  but  rather  Boolean  functions  describing  the  behavior  of  the  circuit  for  the  set  of 
all  possible  data  represented  by  the  Boolean  variables.  To  illustrate  this  idea,  consider 
the  (pseudo)  Domino  CMOS  circuit  shown  in  Fig.  1.  If  the  circuit  is  clocked  correctly, 
the  inputs  are  stable  long  enough  before  the  clock  goes  high,  and  the  inputs  and  clock 
signal  are  then  kept  stable,  the  output  node  should  eventually  change  to  1 if  and  only 
if  the  number  represented  by  the  4-bit  binary  input  vector  a is  greater  than  the  number 
represented  by  the  4-bit  binary  input  vector  6,  and  both  numbers  are  greater  than  Zero. 
In  a simple  OBDD-based  symbolic  simulator  we  would  simply  apply  the  Boolean  input 
variables  at  the  correct  time  and  in  the  end  compare  the  value  on  the  output  node  with 
the  Boolean  function! 


(a3b3  + (a3  ® b3)(a2b2  + (a2  ® b2)(albl  + (ax  ffi  &i)a050)))(&3  + b2  + bx  + bQ ) 

A verifier  based  on  symbolic  simulation  applies  logic  simulation  to  compute  the  circuit’s 
response  to  a series  of  stimuli  chosen  to  detect  all  possible  design  errors.  WHen  a circuit  has 
been  verified  by  simulation,  this  means  that  any  further  simulation  would  not  uncover 
any  errors.  Hence,  the  problem  of  verifying  the  correctness  of  a design  becomes  one  of  sim- 
ulating a large  number  of  input  patterns.  Selecting  such  a set  of  simulation  pat ternsis  a 
nontrivial  task,  since  errors  that  arise  during  the  design  process  cannot  be  easily  character- 
ized. Designer’s  misconceptions,  incomplete  or  inconsistent  specifications,  and  carelessness 
on  the  part  of  the  designer  can  cause  the  resulting  circuit  to  behave  unpredictably.  Worst 
of  all,  it  may  misbehave  only  under  unusual  combinations  of  circumstances.  Rather  than 
trying  to  postulate  a simplified  “fault  model”  for  design  errors,  it  is  more  appropriate  to 
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adopt  a philosophy  that  design  verification  must  work  against  a malicious  adversary.  That 
is,  given  a proposed  simulation  test,  the  adversary  will  attempt  to  create  a circuit  that 
does  not  fulfill  the  specification,  yet  passes  the  test.  A circuit  is  considered  “correct”  only 
if  no  adversary  can  defeat  the  simulation  test.  Thus,  when  a circuit  has  been  “verified’  by 
simulation,  this  means  that  any  further  simulation  would  not  uncover  any  errors. 

Since  a symbolic  simulator  is  based  on  a traditional  logic  simulator,  it  can  use  the  same, 
quite  accurate,  electrical  and  timing  models  to  compute  the  circuit  behavior.  For  example, 
a detailed  switch-level  model,  capturing  charge  sharing  and  subtle  strengths  phenomena, 
and  a timing  model,  capturing  bounded  delay  assumptions,  are  well  within  reach.  Also — 
and  of  great  significance — the  switch-level  circuit  used  in  the  simulator  can  be  extracted 
automatically  from  the  physical  layout  of  the  circuit.  Hence,  the  correctness  results  will 
link  the  physical  layout  with  some  higher  level  of  specification. 

Recently,  Bryant  and  Seger  [10]  developed  a new  generation  of  symbolic  simulator 
based  verifier.  Here  the  simulator  establishes  the  validity  of  formulas  expressed  in  a very 
limited,  but  precisely  defined,  temporal  logic.  This  temporal  logic  allows  the  user  to  express 
properties  of  the  circuit  over  trajectories : bounded-length  sequences  of  circuit  states.  The 
verifier  checks  the  validity  of  these  formulas  by  a modified  form  of  symbolic  simulation. 
Further,  by  exploiting  the  3- valued  modeling  capability  of  the  simulator,  where  the  third 
logic  value  X indicates  an  unknown  or  indeterminate  value,  the  complexity  of  the  symbolic 
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manipulations  is  reduced  considerably. 

This  verifier  supports  a verification  methodology  in  which  the  desired  behavior  of  the 
circuit  is  specified  in  terms  of  a set  of  assertions,  each  describing  how  a circuit  operation 
modifies  some  component  of  the  (finite)  state  or  output.  The  temporal  logic  allows  the  user 
to  define  such  interface  details  as  the  clocking  methodology  and  the  timing  of  input  and 
output  signals.  The  combination  of  timing  and  state  transition  information  is  expressed 
by  an  assertion  over  state  trajectories  giving  properties  the  circuit  state  and  output  should 
obey  at  certain  times  whenever  the  state  and  inputs  obey  some  constraints  at  earlier  times, 

This  form  of  specification  works  very  well  for  circuits  that  are  normally  viewed  as 
state  transformation  systems,  i.e.,  where  each  operation  is  viewed  as  updating  the  circuit 
state.  Using  a prototype  system,  a simple  32  bits  microprocessor  and  a significant  portion 
of  a modern  32  bit  RISC  microprocessor  have  been  verified.  These  circuits  contained 
around  15,000  transistors  and  the  verification  effort  required  less  than  two  hours  on  a 
MIPS  Magnum  3000  workstation.  The  complete  verification  process  including  developing 
the  specification,  deriving  the  circuit  description,  and  carrying  out  the  symbolic  ternary 
simulation,  took  less  than  a person-week. 

3 State  Machine  Analysis 

A second  approach  to  formal  hardware  verification  is  state  machine  analysis.  This  approach 
uses  algorithmic  techniques  to  decide  whether  a finite  state  machine  satisfies  a set  of  user- 
specified  properties.  In  this  brief  overview,  we  focus  on  just  one  particular  approach  to 
state  machine  analysis  called  model- checking.  Other  approaches  to  state  machine  analysis 
include  those  based  on  language  containment  tests  [23], 

We  use  the  example  of  a simple  handshaking  protocol,  illustrated  by  the  timing  di- 
agram  in  Figure  2,  to  describe  the  state-machine  analysis  approach  to  formal  hardware 
verification. 


Figure  2:  Timing  diagram  for  simple  handshaking  protocol. 

This  protocol  could  be  implemented  either  in  software  or  directly  in  hardware.  A 
‘correct’  implementation  of  this  protocol  must  satisfy  the  following  properties: 
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“whenever  the  request  signal  becomes  true , it  must  remain  true 
until  it  is  acknowledged” 

“ every  request  must  eventually  be  acknowledged” 

“ whenever  the  acknowledgement  signal  becomes  true , it  must  remain  true 
until  the  request  signal  returns  to  false” 

“the  request  signal  will  eventually  return  to  false  after 
the  request  is  acknowledged” 

“whenever  the  request  signal  is  false , it  will  remain  false  until 
the  acknowledgement  signal  is  also  false ” 

“the  acknowledgement  signal  will  eventually  return  to  false  after 
the  request  signal  returns  to  false ” 

“ once  falsey  the  acknowledgement  signal  will  remain  false  until 

there  is  a request” 

“whenever  the  acknowledgement  signal  is  falsef 
there  will  eventually  be  a request” 

These  properties  can  be  translated  one-by-one  into  temporal  logic.  The  symbols  U,  O, 
~ and  — > can  be  informally  read  as  “until”,  “eventually”,  “not”  and  “implies”. 

(req  — > (req  U ack)) 

(req  * (Oack)) 

(ack  — ► (ack  U (—req))) 

(ack  — > (<0(— req) ) ) 

((—req)  > ((—req)  U (—ack))) 

((—req)  — > (0(  — ack))) 

((—ack)  — > ((—ack)  U req)) 

((—ack)  — » (Oreq)) 

A program  for  automatic  state  machine  analysis  would  take,  as  input,  a machine- 
readable  list  of  formally  specified  properties  such  as  the  eight  properties  listed  above.  The 
analyzer  program  would  also  take,  as  input,  a machine-readable  description  of  a finite 
state  machine,  for  example,  a model  of  a candidate  implementation  of  the  handshaking 
protocol.  The  analyzer  program  would  then  generate  either  the  answer  “yes”,  meaning 
that  the  state  machine  does  indeed  satisfy  all  of  the  properties  supplied  by  the  user,  or 
the  answer  “no”,  meaning  that  the  machine  fails  to  satisfy  at  least  one  of  these  properties. 
When  the  outcome  is  “no”,  the  analyzer  may  also  produce  helpful  information  about  how 
the  state  machine  fails  to  satisfy  a particular  property,  i.e.,  a counter-example. 

State-machine  analysis  is  bounded  by  the  number  of  states  in  the  finite  state  machine. 
Early  state  machine  analysis  techniques  relied  on  the  explicit  enumeration  of  states  which, 
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reportedly,  limits  the  use  of  these  techniques  to  systems  with  between  103  and  106  reach- 
able states.  Unfortunately,  the  number  of  states  in  a system  may  grow  exponentially  with 
the  number  of  concurrent  components  in  the  system.  To  deal  with  this  “state  explosion 
problem”,  several  groups  [3,9,16]  have  investigated  ways  to  represent  a state  space  symbol- 
ically rather  than  explicitly.  A popular  candidate  for  the  symbolic  representation  of  states 
are  OBDD’s  — mentioned  earlier  in  connection  with  the  symbolic  simulation  approach. 
Using  this  symbolic  approach,  it  is  reported  that  state  machine  analysis  can  be  applied  in 
practice  to  systems  with  in  excess  of  lO20  states  [9]. 

State  machine  analysis  techniques  have  been  applied  to  several  commercial  designs. 
These  techniques  were  used  to  discover  several  possible  execution  sequences  leading  to 
failure  in  a design  for  the  cache  consistency  protocol  of  the  Encore  Gigamax  multiprocessor 
[24].  Another  approach  to  state  machine  analysis  (based  on  language  containment)  has 
been  used  by  AT&T  in  the  design  of  a packet  layer  controller  chip  [23] . 

4 Theorem  Proving 

A third  approach  to  formal  hardware  verification,  computer- assisted  theorem- proving,  is 
based  on  the  construction  of  a proof  in  formal  logic.  This  proof  is  a formal  argument  that 
a hardware  design,  based  on  some  model  of  the  primitive  components,  satisfies  a formal 
specification  of  its  requirements.  Figure  3 shows  an  example  of  a formal  proof  establishing 
the  correctness  of  the  two-component  design  shown  in  Figure  4. 


1. 

ANDGate_IMP  (il,i2,outp) 

[from  above  circuit  diagram] 

2. 

3x.  NANDGate  (il,i2,x)  A NOTGate  (x.outp) 

[by  def.  of ANDGate. IMP] 

3. 

NANDGate  (ii,i2,x)  A NOTGate  (x.outp) 

[strip  off  “3x.”] 

4. 

NANDGate  (il,i2,x) 

[left  conjunct  of  line  3] 

5. 

x * ~>(il  A i2) 

[by  def.  of  NANDGate] 

6. 

NOTGate  (x.outp) 

[right  conjunct  of  line  3] 

7. 

outp  = ->x 

[by  def.  of  MOTGate] 

8. 

outp  = — i ( — > (i  1 A i2)) 

[substitution,  line  5 into  7] 

9. 

outp  = (il  A i2) 

[simplify,  -i-it  = t] 

10. 

ANDGate  (il,i2,outp) 

[by  def.  of  ANDGate] 

11, 

ANDGate.IMP  (il ,i2 ,outp) 

=>  ANDGate  (ii,i2,outp) 

[discharge  assumption,  line  1] 

Figure  3:  Formal  proof  of  correctness  for  an  AND-Gate. 
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Figure  4:  Implementation  of  an  AND-gate  from  an  NAND-gate  and  an  inverter. 


Both  ordinary  human  reasoning  and  formal  proof  can  be  used  to  show  that  a specific 
conclusion  follows  from  a certain  set  of  assumptions  by  accepted  laws  of  reasoning.  How- 
ever, formal  proof  is  a purely  syntactic  process.  A proof  is  formally  defined  as  a sequence 
of  lines  (such  as  the  numbered  sequence  of  lines  in  Figure  3)  where  each  line  follows  from 
a previous  line  by  rule  of  inference.  There  are  only  a finite  number  of  primitive  inference 
rules  (in  fact,  usually  a very  small  number  of  primitive  rules).  The  validity  of  any  particu- 
lar fine  in  a proof  can  be  decided  by  a purely  syntactic  test  based  on  checking  to  see  if  any 
one  of  the  primitive  inference  rules  can  be  used  to  justify  that  particular  line  in  the  proof. 

Unlike  ordinary  human  reasoning,  which  is  notoriously  error-prone,  formal  proof  is 
extremely  rigorous.  Indeed,  its  main  advantage  is  that  it  can  be  mechanically  checked.  The 
main  disadvantage  of  formal  proof,  compared  to  ordinary  human  reasoning,  is  that  formal 
proof  is  overwhelmingly  tedious.  The  very  simple  proof  in  Figure  3 has  just  eleven  lines, 
but  a formal  proof  of  correctness  for  a real  design  (such  as  a simple  microprocessor)  may 
involve  several  million  primitive  inference  steps.  Fortunately,  there  has  been  considerable 
progress  made  towards  the  partial  automation  of  formal  proof.  A very  large  fraction  of 
the  actual  line-by-line  inference  steps  in  a formal  proof  can  be  generated  automatically  by 
computer-based  theorem-prover. 

A digital  circuit  can  be  “verified”  using  a theorem-prover  by  generating  a theorem  which 
states  that  the  formal  specification  of  a design  logically  satisfies  a formal  specification  of 
its  intended  behaviour  (i.e.  a high  level  model).  The  exact  meaning  of  “satisfies”  is 
stated  unambiguously  as  a mathematical  relationship  between  the  two  levels  of  formal 
specification.  In  the  very  simple  example  shown  in  Figure  3,  logical  implication,  >, 
is  used  to  express  the  relationship  between  the  implementation  of  the  AND-gate  and  its 
behavioural  specification: 

ANDGate.IMP  (il,i2,outp)  ANDGate  (il,i2,outp) 

The  theorem-proving  approach  to  formal  hardware  verification  is  a structural  approach 
in  contrast  to  both  symbolic  simulation  and  state-machine  analysis  which  are  behavioural 
approaches.  The  latter  two  approaches,  symbolic  simulation  and  state-machine  analysis, 
both  apply  verification  techniques  to  a ‘flat’  design  they  do  not  require  additional 
details  about  the  hierarchical  structure  of  the  design.  On  the  other  hand,  theorem-proving 
can  only  be  applied  ‘in  the  large’  to  a hierarchically  structured  design.  In  a theorem- 
proving  approach,  the  design  is  verified  hierarchically.  The  proof  hierarchy  is  generally  a 
reflection  of  the  hierarchical  structure  of  the  design.  For  example,  the  bottom  level  of  this 
hierarchical  process  may  involve  the  formal  verification  of  simple  RTL  (Register  Transfer 
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Level)  components  composed  from  primitives  such  as  CMOS  transistors.  Each  kind  of 
component  only  has  to  be  verified  once  — in  contrast  to  other  approaches  which  verify 
every  instance  of  that  particular  component.  At  higher  levels  in  the  verification  hierarchy, 
each  instance  of  a particular  component  uses  the  single  verification  result  obtained  from  a 
lower  level. 

The  direct  re-use  of  a verification  result  for  multiple  identical  instances  of  a partic- 
ular component  is  a very  simple  form  of  how  a single  verification  result  can  be  re-used. 
Theorem-proving  approaches  also  allow  generic  specifications  to  be  formally  verified  where 
a specification  is  parameterized  is  by  scalar  values  (e.g.,  the  number  of  bits  in  a RTL  com- 
ponent) or  even  by  data  types  and  operations  [21].  A single  generic  verification  result  can 
be  instantiated  for  different  parameter  values;  for  example,  the  generic  specification  of  an 
7i-bit  multiplier  can  be  instantiated  for  different  values  of  n,  e.g.,  a 16-bit  multiplier  a 
32-bit  multiplier.  ..  . . .....  _ 

The  theorem-proving  approach  relies  heavily  upon  (and  benefits  greatly  from)  a number 
9f  mathematical  tools.  This  includes,  as  with  symbolic  simulation,  the  ability  to  represent 
data  symbolically.  Mathematical  induction  is  also  critical  for  scaling  with  the  increasingly 
complexity  of  circuit  designs, 

A distinct  advantage  of  the  theorem-proving  approach  to  formal  hardware  verification  is 
the  ability  to  verify  digital  systems  with  respect  to  higher  algebraic  levels.  For  example,  the 
correctness  of  arithmetic  hardware  can  be  stated  directly  in  terms  of  arithmetic  operations 
on  natural  numbers  rather  than  Boolean  operations  on  bit-yector$.  This  is  often  referred 
to  as  data  abstraction  an  illustrative  example  of  this  technique  is  given  by  Chin  [12]  in 
verifying  arithmetic  hardware  for  signal  processing  applications.  Other  kinds  of  abstraction 
include  temporal  abstraction  which  is  a technique  for  relating  computational  behaviour  at 
increasingly  abstract  time  scales. 

Among  the  best  known  interactive  theorem-provers  are  the  Boyer-Moore  Theorem 
Prover  [4]  and  the  Cambridge  HOL  System  [18,19].  The  Boyer-Moore  Theorem  Prover 
has  been  used  by  researchers  at  Computational  Logic  Inc.  to  develop  an  multi-level  proof 
of  correctness  for  a complete  computer  system  including  both  hardware  and  software  levels 
[1,2].  The  Cambridge  HOL  System  has  been  used  by  researchers  at  Cambridge  Univer- 
sity to  verify  aspects  of  the  commercially-available  Viper  microprocessor  designed  by  the 
British  Ministry  of  Defence  for  safety-critical  applications  [6,13,14,15]. 

5 Summary  and  Future  Directions 

This  paper  has  briefly  described  three  main  approaches  to  formal  hardware  verification: 
symbolic  simulation,  state-machine  analysis  and  theorem-proving.  There  have  already 
been  some  trial  applications  of  these  verification  techniques  to  real  commercial  designs  (as 
mentioned  earlier)  and  there  is  evidence  of  increasing  industrial  interest  in  these  techniques. 

Many  research  efforts  in  this  area  are  now  focussed  on  the  issue  of  integrating  formal 
verification  techniques  w}th  conventional  CAD.  For  example,  researchers  at  Cambridge 
University  are  investigating  the  use  of  conventional  HDL’s  (Hardware  Description  Lan- 
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guages)  such  as  Ella  and  VHDL  as  specification  languages  for  theorem-proving  techniques 
[5].  Another  example  is  work  that  investigates  links  between  formally  verified  hardware 
specifications  and  conventional  CAD  tools  such  as  silicon  compilers  [22]. 

It  is  unlikely  that  any  one  of  the  three  verification  approaches  described  in  this  paper 
offers,  by  itself,  a “complete”  approach  to  verifying  digital  hardware.  However,  we  believe 
that  a “complete”  approach  may  be  achieved  by  some  combination  of  the  three  approaches 
described  here.  We  are  currently  developing  a hybrid  approach  that  based  on  a combi- 
nation of  symbolic  simulation  (using  the  COSMOS  system)  and  theorem-proving  (using 
the  Cambridge  HOL  system).  The  objective  of  our  research  is  a hybrid  formal  verification 
methodology  (and  supporting  tools)  which  combines  the  complementary  advantages  of 
theorem-proving  and  symbolic  simulation.  This  methodology  would  allow  a very  abstract 
specification  of  a digital  system  (specified  with  the  full  expressive  power  of  higher-order 
logic)  to  be  verified  with  respect  to  a switch-level  model  of  a CMOS  digital  circuit.  Initial 
progress  on  the  development  of  a “mathematical  interface”  for  this  hybrid  approach  is 
reported  in  [26]. 
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High  Accuracy  Switched- Current  Circuits  Using  an 

Improved  Dynamic  Mirror 

G.  Zweigle  and  T.  Fiez 

School  of  Electrical  Engineering  and  Computer  Science 
Washington  State  University 
Pullman,  WA  99164 

1 Abstract  - The  switched-current  technique,  a recently  developed  circuit  ap- 
proach to  analog  signal  processing,  has  emerged  as  an  alternative/ compliment 
to  the  well  established  switched-capacitor  circuit  technique.  High  speed  switched- 
current  circuits  offer  potential  cost  and  power  savings  over  slower  switched- 
capacitor  circuits.  Accuracy  improvements  are  a primary  concern  at  this  stage 
in  the  development  of  the  switched-current  technique.  Use  of  the  dynamic  cur- 
rent mirror  has  produced  circuits  that  are  insensitive  to  transistor  matching 
errors  [1],  The  dynamic  current  mirror  has  been  limited  by  other  sources  of 
error  including  clock-feedthrough  and  voltage  transient  errors.  In  this  paper 
we  present  an  improved  switched-current  building  block  using  the  dynamic 
current  mirror.  Utilizing  current  feedback  the  errors  due  to  current  imbal- 
ance in  the  dynamic  current  mirror  are  reduced.  Simulations  indicate  that 
this  feedback  can  reduce  total  harmonic  distortion  by  as  much  as  9dB.  Addi- 
tionally, we  have  developed  a clock-feedthrough  reduction  scheme  for  which 
simulations  reveal  a potential  lOdB  total  harmonic  distortion  improvement. 
The  clock-feedthrough  reduction  scheme  also  significantly  reduces  offset  er- 
rors and  allows  for  cancellation  with  a constant  current  source.  Experimental 
results  confirm  the  simulated  improvements. 

1 Introduction 

The  switched-current  (SI)  sampled-data  signal  processing  technique  is  becoming  a viable 
alternative  to  the  switched-capacitor  (SC)  technique.  Unlike  SC  circuits,  which  require 
additional  processing  steps  to  fabricate  precision  linear  capacitors,  SI  circuits  can  be  inte- 
grated in  a standard  digital  CMOS  process.  In  addition,  SI  circuits  can  operate  with  low 
power  supply  voltages,  they  can  operate  at  high  speeds,  and  they  are  very  area  efficient. 
The  drawback  of  SI  circuits  at  this  time  is  their  limited  accuracy.  This  problem  must 
be  overcome  in  order  for  switched-current  circuits  to  gain  the  wide  acceptance  switched- 
capacitor  circuits  have  attained.  In  this  paper,  an  SI  circuit  is  presented  that  significantly 
improves  the  accuracy  of  the  current-mode  system. 

’This  research  was  supported  in  part  by  a grant  from  the  National  Science  Foundation  Center  for  the 
Design  of  Analog/Digital  Integrated  Circuits  (CDADIC)  at  Washington  State  University,  University  of  Wash- 
ington, and  Oregon  State  University 
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Figure  1:  Switched-current  track-and-hold  circuit. 

2 Switched-Current  Circuit  Operation 

2.1  Current  Track-and-Hold  

The  current  track-and-hold  (T/H)  is  a basic  building  block  of  switched-current  circuits, 
Fig.  1.  Transistors  Ml  and  M2  are  biased  in  saturation  by  the  current  sources,  indicated 
as  I,  and  the  track-and-hold  operation  is  controlled  by  switch  transistor  M3.  When  the 
clock  is  high,  the  input  current  is  mirrored  to  the  output.  The  parasitic  gate  capacitance 
of  transistor  M2  stores  a voltage  corresponding  to  the  value  of  the  input  current.  When 
the  clock  is  lowand  transistor  M3  is  turned  off,  the  drain  current  of  M2  is  held  at  a value 
corresponding  to  the  voltage  stored  on  the  gate  of  M2. 

The  current  track-and-hold  performs  the  four  signal  processing  operations  of  inversion, 
summation,  scaling,  and  delay.  Consider  initially  that  the  clock  is  high  and  the  gates  of 
Ml  and  M5  are  shorted.  When  a signal  il  is  input  to  the  diode  connected  transistor  Ml, 
it  is  mirrored  to  transistor  M2.  The  drain  current  of  M2  is  I -f-  ij.  The  output  current  is 
— ii.  This  stage  inverts  the  current.  The  output  current  is  a sum  of  two  input  currents  by 
simply  connecting  wires.  Scaled  current  output  is  obtained  by  scaling  the  aspect  ratio  of 
M2  to  Ml.  Finally,  signal  delay  is  controlled  by  switching  transistor  M3  on  and  off. 

By  using  these  basic  signal  processing  operations,  current  track-and-hold  circuits  can 
be  combined  to  perform  more  complicated  operations.  One  of  these,  the  integrator,  is 
realized  by  cascading  two  current  track- and-holds  with  feedback  as  shown  in  Figure  2. 


9.1.3 


3rd  NASA  Symposium  on  VLSI  Design  1991 

VDD 


The  transfer  function  of  this  circuit  is 


_ [Kjiiz  1 -i3z  °-6) 

} (1  - z-1)  ' 1 ' 

The  non-inverting  integrator  input  is  at  the  input  of  the  first  T/H  and  the  inverting 
integrator  input  is  at  the  input  of  the  second  T/H.  The  switched-current  integrator  has 
been  shown  to  be  directly  analogous  to  the  switched-capacitor  integrator  [2].  Note  that, 
as  with  the  SC  integrator,  the  two  switches  of  the  SI  integrator  are  controlled  by  two 
phase  non-overlapping  clocks.  Additionally,  the  integrator  coefficient  K is  determined  by 
the  aspect  ratio  of  transistor  M5  to  transistor  M3.  In  the  SC  integrator  a capacitor  ratio 
determines  this  factor.  The  reliance  of  this  switched-current  circuit  on  transistor  matching 
has  contributed  to  its  limited  accuracy. 

2.2  Dynamic  Current  Mirror 

The  dynamic  current  mirror  eliminates  matching  errors  present  in  simple  current  track- 
and-holds  by  mirroring  current  in  time  rather  than  space,  Figure  3.  Operation  of  the 
dynamic  mirror  is  controlled  by  switches  MCI,  MC2,  and  MC3.  These  switches  require  a 
two  phase  non-overlapping  clock,  similar  to  the  current  track-and-hold  integrator.  It  has 
been  shown  that  an  integrator  composed  of  the  dynamic  mirror  cell  does 

not  require  any  more  clocks  that  the  SI  integrator  presented  previously  [3].  Transistor 
Ml  is  biased  by  the  DC  current  source  I.  Initially  the  switches  MCI  and  MC2  are  closed. 
The  signal  current  is  read  into  the  diode  connected  transistor  Ml,  producing  a voltage  on 
its  gate.  This  voltage  is  proportional  to  the  square  root  of  the  input  current  for  saturated 
operation.  The  current  is  then  read  out  by  opening  switches  MCI  and  MC2  while  switch 
MC3  is  closed.  The  stored  voltage  produces  an  output  current  that  is  an  inverted  replica 
of  the  input  current.  The  dynamic  mirror  differs  from  the  simple  track-and-hold  by  using 
switches  to  time  multiplex  one  transistor,  resulting  in  a float-and-hold  operation.  Unlike 
the  simple  current  mirror  track-and-hold,  where  the  switch  transistor  passes  nearly  zero 
current,  the  controlling  switches  MC2  and  MC3  of  the  dynamic  mirror  must  pass  the  signal 
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Figure  3:  Dynamic  current  mirror  float-and-hold  cell. 


current.  Also,  since  the  same  transistor  is  used  to  both  mirror  in  and  out  the  signed  current, 
the  method  of  scaling  currents  by  scaling  transistor  aspect  ratios  is  not  possible  using  the 
dynamic  mirror.  To  perform  signal  scaling  additional  dynamic  mirrors  are  multiplexed  in 
time  [4|.  ^ : Li:  _ ; 

3 Dynamic  Current  Mirror  Error  Sources 

Although  SI  circuits  have  been  shown  to  be  a viable  signal  processing  circuit  technique, 
the  poor  accuracy  limits  system  performance.  Sources  of  this  inaccuracy  for  the  dynamic 
mirror  are  finite  outpuf  impedance  effects,  clock-feedthrough  effects,  and  voltage  spikes. 
While  these  effects  can  be  reduced  by  using  large  current  mirror  devices,  this  solution  is 
not  optimum  because  increasing  device  size  increases  area  requirements  and  reduces  speed 
capabilities.  The  finite  output  impedance  of  a dynamic  current  mirror  results  in  current 
division  between  stages.  This  division  of  the  signal  current  becomes  an  AC  gain  error  with 
magnitude 

Ai  = (ZiJZ^ti.  (2) 

Ideally  the  output  impedance  of  the  dynamic  mirror  would  be  infinite  and  the  input 
impedance  zero.  Because  these  conditions  are  not  met  in  real  implementations,  errors^are 
introduced  in  the  output  current.  The  output  impedance  of  a dynamic  current  mirror 
can  be  increased  with  the  use  of  a cascade  circuit.  Because  of  its  extremely  high  output 
impedance  and  special  feedback  properties  the  regulated  gate  cascade  [5]  was  employed 
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Figure  4:  Current  mismatch  in  the  dynamic  current  mirror  at  the  switching  interval. 

for  reducing  the  AC  gain  errors.  Clock-feedthrough  effects  in  switched-current  circuits 
are  more  severe  than  in  switched-capacitor  circuits  [6].  This  is  because  the  parasitic  gate 
capacitances  of  SI  circuits  are  smaller  than  the  linear  capacitors  implemented  in  SC  circuits. 
By  using  larger  capacitors  to  hold  the  signal,  SC  circuits  reduce  the  effect  of  small  clock- 
feedthrough  charges.  In  SI  circuits  clock-feedthrough  results  in  offset  errors  and  increases 
total  harmonic  distortion.  Several  methods  of  reducing  this  injected  charge  in  SI  circuits 
have  been  studied  to  date.  These  include  capacitive  feedback  [7],  the  use  of  dummy 
switches  [8],  current  difference  cells  [9],  and  an  adaptive  clock  [7,9].  In  this  paper  a scheme  is 
presented  for  reducing  clock-feedthrough  in  the  dynamic  mirror  with  an  improved  adaptive 
clock.  Finally,  voltage  spike  errors  are  introduced  in  the  dynamic  mirror  by  the  operation 
of  the  two  phase  nonoverlapping  clocks.  During  the  switching  of  transistors  MC2  and 
MC3  there  will  be  a nonzero  interval  of  time  when  all  of  the  switches  are  in  the  OFF  state. 
During  this  time  period  the  data  holding  transistor  Ml  and  the  current  source  will  be 
attempting  to  draw  two  different  currents.  This  can  be  seen  in  Figure  4.  Transistor  Ml 
will  have  a gate-source  voltage  set  by  the  input  current  which  will  give  a drain  current  of 
I i.  The  current  mirror  will  be  delivering  current  I.  As  a result,  the  voltage  at  the  drain 
of  transistor  Ml  will  have  to  move  to  counter  the  current  imbalance.  For  negative  input 
currents,  the  voltage  will  increase  in  an  attempt  to  shut  off  the  current  source  and  make 
transistor  Ml  draw  more  current.  For  positive  input  currents  the  voltage  will  decrease, 
attempting  to  draw  more  current  from  the  current  source  and  less  from  transistor  Ml. 

There  are  two  paths  that  these  voltage  transitions  can  couple  through  to  the  data 
holding  node.  One  is  through  switch  transistor  MCI  as  it  turns  off  with  switch  transistor 
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MC2.  The  other  is  through  the  drain-gate  capacitance  of  transistor  Ml.  Although  it  may 
seem  that  a cascade  could  be  used  to  buffer  the  drain  of  transistor  Ml  from  the  spikes, 
this  circuit  will  only  be  useful  when  functioning  as  designed.  For  positive  input  currents 
negative  spikes  will  cause  the  cascade  to  leave  its  proper  operating  point.  Subsequent 
to  this,  the  drain  of  Ml  will  no  longer  be  protected  and  the  spike  will  couple  through 
Cg,i.  The  resulting  error  from  the  voltage  spikes  is  signal  dependent  in  both  magnitude 
and  polarity,  difficult  to  predict,  and  sometimes  worse  than  switch  charge  injection  effects. 
The  voltage  spike  error  must  be  eliminated  before  clock-feedthrough  cancellation  schemes 
can  be  effective. 


4 Voltage  Spike  Error  Reduction 

Two  solutions  to  reducing  voltage  spike  errors  have  been  developed.  The  first  involves 
modifying  the  clock  phasing  of  the  dynamic  mirror.  By  turning  switch  transistor  MCI  off 
slightly  before  turning  switch  transistor  MC2  off,  the  path  of  the  voltage  spike  through 
transistor  MCI  is  eliminated.  The  new  clocking  scheme  presented  here  does  not  add 
another  clock  phase  to  the  circuit,  only  a delay  is  needed.  Transients  occurring  when  the 
transistor  goes  from  the  hold  mode  to  the  output  float  mode  are  irrelevant  because  a new 
current  value  will  be  read  into  the  diode  connected  transistor  at  this  time.  The  delay 
only  needs  to  be  as  long  as  the  turn  off  time  of  transistor  MCI.  It  can  be  implemented 
on  chip  with  an  even  number  of  cascaded  inverters.  The  other  solution  to  voltage  spike 
errors  must  eliminate  the  voltage  spike  from  coupling  through  the  drain-gate  capacitance 
of  transistor  Ml.  This  is  accomplished  by  using  current  feedback  around  a regulated  gate 
cascade  dynamic  current  mirror,  Fig.  5,  The  regulated  gate  cascade  is  used  to  increase 
the  output  impedance  of  the  dynamic  mirror  and  the  current  feedback  is  used  to  keep  the 
cascade  in  its  proper  operating  region  for  positive  input  currents.  The  circuit  operates 
as  follows.  Transistor  M3  senses  variations  in  the  drain-source  voltage  of  transistor  Ml^ 
Since  a constant  current  biases  M3,  these  variations  will  be  amplified  by  the  loop  gain  of 
transistors  M3  and  follower  M2.  Differences  between  the  drain-source  voltage  of  transistor 
Ml  and  the  gate  voltage  of  transistor  M3  required  to  supply  constant  current  J will  be 
amplified,  stabilizing  the  drain  voltage  of  Ml.  When  the  current  in  the  current  source, 
M6,  is  smaller  than  the  current  in  the  drain  of  transistor  M2,  the  voltage  on  the  drain  of 
transistor  M2  decreases  because  of  the  current  mismatch.  This  increases  the  gate  voltage 
of  transistor  M4  due  to  the  voltage  feedback  of  transistor  M3.  Transistor  M4  subsequently 
sources  additional  current  through  current  mirror  transistors  M5  and  M6  to  cancel  the 
current  imbalance.  When  the  current  imbalance  is  corrected,  the  voltage  on  the  gate  of 
transistor  M4  returns  to  its  original  DC  value.  The  current  feedback,  in  the  meantime, 
keeps  the  voltage  on  the  drain  of  transistor  M2  more  stable  which  keeps  the  cascade  in  its 
proper  operating  region.  With  the  caseade^Fuiictioning  throughout  the  switching  interval, 
the  drain  of  transistor  Ml  was  buffered  from  the  transients  and  the  voltage  stpredon  its 
gate  remained  unaffected. 

In  order  to  verify  the  improvement  in  circuit  performance  with  the  current  feedback 
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Figure  5:  Regulated  gate  cascade  with  current  feedback. 

scheme,  simulations  were  performed  for  a dynamic  current  mirror  biased  with  lOOuA  using 
a 5kHz  sinusodial  50uA  input  signal.  The  circuit  was  clocked  at  100kHz.  By  eliminating 
errors  due  to  voltage  spiking,  the  harmonic  distortion  was  improved  by  almost  9dB  over 
the  cascade  without  current  feedback. 


5 Clock-Feedthrough  Error  Reduction 

Clock- feedthrough  has  been  extensively  analyzed  in  the  literature  [8,10].  Analysis  shows 
that  clock-feedthrough  is  dependent  on  the  aspect  ratio  of  the  switch  transistor  with  respect 
to  the  data  holding  transistor,  the  clock  slope,  and  the  magnitude  of  the  input  signal. 
The  signal  dependence  of  clock-feedthrough  leads  to  difficulties  in  predicting  the  error, 
a necessary  condition  for  cancellation.  The  adaptive  clock  is  a technique  for  reducing 
clock-feedthrough  through  control  of  the  ON  conductance  of  the  switch.  This  control 
simultaneously  reduces  the  clock  swing  on  the  switch  and  causes  the  gate-source  voltage 
of  the  switch  to  remain  constant  for  varying  input  signals.  With  a constant  gate-source 
voltage,  the  charge  injected  by  the  switch  becomes  constant.  This  results  in  the  possibility 
of  canceling  the  error  current  with  a constant  current  source.  In  order  to  use  such  a 
system,  the  nonsaturated  region  of  operation  must  be  used  for  the  data  holding  transistor. 
For  saturated  operation  nonlinear  transformations  between  voltage  and  current  result  in 
harmonic  distortion  even  if  a constant  clock-feedthrough  voltage  can  be  generated.  For 
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the  nonsaturated  region  of  operation  the  transformations  are  linear.  A simplified  equation 
for  the  drain  current  of  a transistor  operating  in  the  nonsaturated  region  is  given  in  Eqn. 
3.  Added  to  the  gate-source  voltage  is  the  constant  clock-feedthrough  voltage,  Vcf . 


ia,  = 0Vu(Va.  + VcJ  -VT-  \/2Vd,. 


(3) 


Keeping  only  the  clock-feedthrough  term,  it  can  be  seen  that  the  error  output  current 
is  given  simply  by. 


icf  = PVdsVcf,  (4) 

The  clock-feedthrough  voltage  contributes  a DC  offset.  If  constant,  no  harmonics  are 
generated.  A new  adaptive  clock  scheme,  applied  to  the  dynamic  mirror,  is  shown  in 
Figure  6.  When  the  clock  signal  is  high,  the  inverter  output  is  low  and  the  gate-source 
voltage  of  switch  MCI  is  set  by  the  gate-source  voltage  of  transistor  M2,  regardless  of  the 
voltage  at  the  source  of  MCI.  When  the  clock  signal  goes  low,  the  inverter  turns  on  which 
shorts  the  gate  and  source  of  MCI  together.  This  turns  off  the  transistor  and  the  input 
signal  is  held  on  the  gate  of  Ml.  By  using  this  control,  the  gate-source  voltage  of  the 
switch  when  on  is  always  equal  to  the  constant  gate-source  voltage  of  transistor  M2.  In 
order  to  cancel  the  constant  clock-feedthrough  generated  by  the  adaptive  clock  a constant 
current  that  is  equivalent  to  the  error  current  needs  to  be  generated.  It  has  been  shown 
that  the  integrator  circuit  presented  earlier  will  perform  such  a task,  [3].  The  adaptive 
clock  is  also  useful  as  a clock  swing  limiter.  Simulations  show  that  for  a data  holding 
transistor  (Ml  in  Figure  6)  width  to  length  ratio  of  7/5  , which  gives  a transistor  area  of 
35  X 10~12  square  meters,  the  use  of  an  adaptive  clock  without  cancellation  reduces  total 
harmonic  distortion  by  10  dB.  The  DC  offset  is  reduced  by  an  order  of  magnitude.  As 
the  data  holding  transistor  size  is  increased,  the  adaptive  clock’s  effect  on  total  harmonic 
distortion  decreases  due  to  an  increase  in  the  data  holding  capacitance.  This  limits  the 
influence  of  clock  swing  reduction.  However,  even  for  larger  transistor  sizes,  28/20,  the 
use  of  an  adaptive  clock  without  cancellation  continues  to  improve  the  DC  offset  and  for 
all  device  sizes  the  error  is  kept  constant.  The  adaptive  clock  circuit  of  Figure  6 was 
fabricated  through  MOSIS  in  a two  micron  CMOS  p-well  process.  The  dynamic  mirror 
used  the  regulated  gate  current  feedback  circuit  presented  earlier  as  a cascade.  Initial 
experimental  results  verify  the  improvements  indicated  by  simulations.  For  a large  data 
holding  transistor  size  of  28/20  the  DC  offset  error  was  reduced  by  40 


6 Conclusion 

The  dynamic  current  mirror  is  a useful  circuit  to  reduce  reliance  on  transistor  matching. 
In  order  to  effectively  use  the  dynamic  current  mirror,  consideration  has  to  be  given  to  the 
effects  of  transients  when  switching  currents.  Clock  delays  and  current  feedback  were  used 
to  reduce  distortion  due  to  voltage  spikes  that  occur  during  intervals  of  current  mismatch. 
Clock-feedthrough  is  a source  of  distortion  that  effects  all  methods  of  sampled-data  signal 
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Figure  6:  Adaptive  clock  applied  to  the  dynamic  mirror. 


processing.  For  the  dynamic  mirror  an  adaptive  clock  was  developed  that  was  shown  to 
both  reduce  and  make  constant  charge  injected  by  the  switch. 
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Abstract  - A constant  current  source  has  been  designed  which  makes  use  of 
on  chip  electrically  erasable  memory  to  adjust  the  magnitude  and  tempera- 
ture coefficient  of  the  output  current.  The  current  source  includes  a voltage 
reference  based  on  the  difference  between  enhancement  and  depletion  transis- 
tor threshold  voltages.  Accuracy  is  i 3%  over  the  full  range  of  power  supply, 
process  variations,  and  temperature  using  eight  bits  for  tuning. 

1 Introduction 

The  lack  of  precision  components  in  CMOS  integrated  circuits  has  traditionally  forced 
design  engineers  to  depend  upon  external  components  and  matching  of  on  chip  components 
to  realize  precision  functions.  For  example,  switched  capacitor  filters  [1]  realize  precise 
transfer  functions  only  when  supplied  with  an  accurate  clock  frequency,  which  is  usually 
generated  by  an  external  crystal  oscillator.  The  locations  of  poles  and  zeros  are  relative 
to  the  clock  frequency,  and  are  determined  by  accurate  on  chip  capacitor  ratios.  Switched 
current  [2],  Transconductor  C [3],  and  MOSFET  C [4]  filters  also  depend  on  an  external 
frequency  reference,  and  matching  of  transistors  to  realize  their  transfer  functions.  In 
cases  where  external  components  are  unacceptable,  some  kind  of  tuning  of  the  non-ideal 
components  must  be  accomplished  to  realize  precision  functions.  Laser  trimming  is  one 
method  which  works  well,  but  requires  expensive  equipment,  and  a special  process.  Blowing 
poly-silicon  fuses  is  inexpensive,  but  sometimes  unreliable,  and  some  types  of  tuning  are 
difficult  to  achieve  with  fuse  blowing.  Neither  laser  trimming,  nor  fuse  blowing  is  reversible, 
a distinct  hindrance  if  a tuning  operation  requires  more  than  one  iteration.  If  the  circuit  to 
be  tuned  is  fabricated  in  a process  which  includes  nonvolatile  electrically  erasable  memory, 
floating  gate  transistors  can  be  programmed  to  trim  analog  performance.  Two  different 
methods  may  be  used  to  employ  the  floating  gate  transistor  to  tune  an  analog  circuit.  First, 
an  analog  voltage  can  be  stored  on  the  floating  gate  to  change  the  current  or  resistance 
from  source  to  drain  [5,6].  The  current  or  resistance  will  be  a function  of  temperature,  and 
possibly  power  supply  voltage.  The  second  method  is  to  use  nonvolatile  digital  memory  to 
select  how  much  resistance  or  capacitance  is  connected  to  a node,  or  which  tap  of  a resistor 
will  be  connected  in  a circuit.  The  second  method  has  the  advantage  of  insensitivity  to 
temperature  and  power  supply  voltage,  assuming  the  resistance  of  the  analog  switch  is  low, 
while  it  has  the  disadvantage  of  requiring  more  circuitry  to  do  the  tuning.  In  this  paper, 
a circuit  is  described  which  uses  the  later  method  to  tune  the  magnitude  and  temperature 
coefficient  of  a current  source. 
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2 Application 

The  tunable  current  source  described  in  this  paper,  is  a part  of  a larger  circuit  which  emits 
a constant  frequency  square  wave,  independent  of  temperature,  processing,  and  power 
supply  voltage.  The  circuit  consists  of  a voltage  controlled  oscillator  (VCO),  a frequency 
to  current  converter,  and  an  integrator  connected  in  a feedback  loop  as  shown  in  figure  1. 


V out 


Figure  1:  Constant  frequency  circuit. 

The  control  voltage  for  the  VCO  is  used  in  other  cells  on  the  chip.  The  frequency  to 
current  converter  is  based  on  a switched  capacitor  network  whose  average  current  is  given 
by  equation  1 [7]: 

lave  = f * V * C (1) 

where  / is  frequency,  V is  the  voltage  across  the  switched  capacitor  network,  and  C is 
the  value  of  the  switched  capacitor.  At  a fixed  temperature,  both  V , and  C are  constants 
in  this  circuit,  which  makes  the  average  current  proportional  to  frequency.  The  difference 
between  the  constant  current  and  the  switched  capacitor  current  is  integrated,  and  used 
to  control  the  VCO.  The  high  gain  of  the  feedback  loop  ensures  that  the  VCO  emits  a 
frequency  which  causes  the  current  in  the  switched  capacitor  network  to  exactly  cancel  the 
constant  current.  Since  the  capacitor  has  a non-zero  temperature  coefficient,  the  constant 
current  source  must  have  a temperature  coefficient  which  cancels  that  of  the  capacitor.  The 
current  source  must  also  compensate  for  variations  in  reference  voltage  and  capacitance 
due  to  processing. 
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3 Voltage  Reference 

In  this  circuit,  a constant  current  will  be  derived  from  a constant  voltage  as  shown  in 
figure  2. 


Figure  2:  Voltage  to  current  converter. 

The  performance  of  the  current  source  will  only  be  as  good  as  the  performance  of 
the  voltage  reference,  so  the  voltage  reference  must  have  good  rejection  of  power  supply 
variations  and  temperature.  Three  possibilities  come  to  mind  to  generate  a reference 
voltage  on  a chip.  A power  supply  voltage  divider  is  the  most  simple  reference  available,  but 
since  the  specification  for  the  current  is  tighter  than  the  variation  of  the  power  supply,  the 
voltage  divider  can  not  be  used  to  generate  the  reference  voltage.  The  bandgap  reference  [8] 
is  probably  the  most  accurate  voltage  reference  which  can  be  built  on  a CMOS  chip,  but  it 
can  not  be  used  on  this  chip  because  substrate  currents,  caused  by  the  bipolar  transistors, 
are  unacceptable.  A threshold  voltage  reference  [9]  is  based  on  the  difference  between  the 
threshold  voltages  of  depletion  and  an  enhancement  MOSFET  s. 

VT'S  « Vu  (1  - aiT)  - Vtd  (1  - <x2T ) (2) 

Where  Vte  is  the  n-channel  enhancement  threshold  voltage,  Vtd  is  the  n-channel  depletion 
threshold  voltage,  ai  is  the  temperature  coefficient  of  Vte , a2  is  the  temperature  coefficient 
of  Vtd,  and  T is  temperature.  Since  <*i,  and  a2  are  approximately  equal,  Vref  has  a very 
small  temperature  coefficient. 

The  mobility  temperature  coefficient  can  be  ignored  by  making  the  width  to  length 
ratio  of  the  transistors  large  for  the  amount  of  current  flowing  through  them,  and  by 
making  the  width  to  length  ratio,  and  drain  to  source  current  equal  for  both  transistors. 
This  makes  the  gate  to  source  voltage  mostly  Vt,  and  the  gate  to  source  voltage  above 
Vt  is  approximately  equal  for  both  transistors.  The  threshold  voltage  of  both  transistors 
is  also  dependent  on  the  source  to  bulk  voltage.  Depletion  and  enhancement  transistors 
have  approximately  the  same  body  effect  factor,  so  if  the  transistors  have  the  same  source 
to  bulk  voltage,  the  body  effect  will  change  both  threshold  voltages  equally.  This  gets 
canceled  by  the  subtraction  as  shown  in  equation  2.  One  circuit  which  implements  the 
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reference  with  equal  current,  and  source  to  bulk  voltage  in  both  transistors  is  shown  in 
figure  3. 


Unfortunately,  the  reference  voltage  is  larger  than  the  power  supply  in  some  cases, 
rendering  the  circuit  useless  for  this  application.  To  rectify  this  situation,  the  circuit  in 
figure  4 was  designed  which  sums  half  of  the  two  threshold  voltages.  The  body  effect  is 
no  longer  equal  for  the  two  transistor,  so  the  output  voltage  will  be  sensitive  to  the  bulk 
voltage.  Equal  current  flows  through  the  two  transistors,  and  the  width  to  length  ratios 
are  large  compared  to  the  current,  so  mobility  temperature  coefficients  are  negligible. 


4 Voltage  to  Current  Conversion 

Voltage  to  current  conversion  will  be  accomplished  by  maintaining  the  reference  voltage 
across  a resistor  using  an  opamp  as  shown  in  figure  2,  This  circuit  will  be  independent  of 
temperature  only  if  the  resistor  and  the  voltage  reference  have  a zero  temperature  coeffi- 
cient. This  is  far  from  true  on  silicon,  where  the  best  resistor  available  has  a temperature 
coefficient  of  approximately  0.1  %/0C,.  In  addition  to  the  temperature  coefficient,  the  value 
of  the  resistor  and  voltage  reference  are  dependent  on  processing.  To  tune  the  magnitude 
of  the  resistor  to  account  for  processing,  taps  can  be  programmed  as  shown  in  figure  5. 
The  resistance  of  the  switchs  must  be  made  small  compared  to  the  linear  resistor. 

In  order  to  produce  a current  with  a small  temperature  coefficient,  the  voltage  across 
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Figure  5:  Tuning  Scheme  for  the  Magnitude  of  1^. 
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the  resistor  must  have  a positive  temperature  coefficient  to  cancel  the  positive  tempera- 
ture coefficient  of  the  resistor.  One  way  to  generate  a voltage  with  a positive  temperature 
coefficient  is  to  subtract  a voltage  with  a negative  temperature  coefficient  from  a con- 
stant voltage.  The  threshold  voltage  of  an  enhancement  transistor  has  a linear  negative 
temperature  coefficient  suitable  for  subtraction  from  Vref . To  make  the  threshold  voltage 
independent  of  power  supply  voltages,  a p-channel  transistor  can  be  used  with  its  source 
connected  to  the  bulk  to  get  rid  of  the  body  effect.  The  circuit  in  figure  6 shows  this 
temperature  coefficient  cancellation. 


Figure  6:  Temperature  Coefficient  cancellation  for  lout- 

To  tune  the  temperature  coefficient  of  the  current  to  zero,  the  magnitude  of  the  refer- 
ence voltage  can  be  adjusted.  This  makes  the  negative  temperature  coefficient  voltage  a 
larger  or  smaller  portion  of  the  reference  voltage,  which  adjusts  the  overall  temperature 
coefficient.  This  temperature  coefficient  cancellation  can  be  expressed  as  follows: 

_ K{V„,-  K,(l-q,r)  (3) 

(1  + a,T) 

where  Vtp  is  the  threshold  voltage  of  a p-channel  enhancement  transistor,  ot\  is  the  temper- 
ature coefficient  of  Vtp,  and  a2  is  the  temperature  coefficient  of  Rref.  K\  is  the  fraction  of 
the  reference  voltage  chosen  by  the  first  tapped  resistor,  K2  is  the  portion  of  Rref  chosen 
by  the  second  tapped  resistor.  Ki  and  K2  range  from  zero  to  one.  If  we  define  two  new 


terms: 

Vtune  = KxV„f  - Vtp 

(4) 

and 

/3  =-  VipOC\/Vtune 

(5) 
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then  equation  3 can  be  rewritten  as: 

Vtune(l+/3T) 

K2Rref{l  +a2T) 


(6) 


When  (3  equals  a2,  lout  has  a zero  temperature  coefficient. 

The  key  to  all  this  temperature  coefficient  cancellation  is  that  all  the  components  have 
only  first  order  temperature  coefficients.  Measurements  from  silicon  indicate  that  poly- 
silicon resistors  have  linear  temperature  coefficients,  as  well  as  the  smallest  temperature 
coefficient  of  any  resistor  available  on  chip.  The  poly-silicon  resistor  also  has  no  voltage 
coefficient,  since  there  is  no  reverse  bias  junction  which  could  change  the  dimension  of 
the  resistor.  The  threshold  voltages  of  n-channel  and  p-channel  transistors  have  negligible 
higher  order  temperature  coefficient  terms.  Threshold  voltages  of  n-channel  enhancement 
and  depletion  transistors  were  found  to  track  well  over  temperature.  A simplified  schematic 
of  the  entire  circuit  is  presented  in  figure  7. 


5 Tuning  Strategy 

Equation  3 shows  that  Ki  adjusts  the  magnitude  of  the  current  as  well  as  the  temperature 
coefficient,  so  Ki  must  be  adjusted  first,  then  K2  can  calibrate  the  current  to  the  desired 
value.  Tuning  the  temperature  coefficient  of  this  circuit  in  a production  test  environment 
without  non-volatile  memory  would  be  a logistic  nightmare.  Somehow  the  result  of  the 
first  temperature  measurement  would  have  to  be  stored  with  a serial  number  for  each  die 
for  use  during  the  second  temperature  test.  No  such  serial  numbers  are  available  for  each 
die,  and  moreover,  testing  at  two  temperatures  is  usually  done  before  and  after  packaging; 
one  temperature  during  wafer  sort,  and  the  other  temperature  during  final  assembly  test. 
Fortunately,  with  non-volatile  memory  the  results  of  the  first  temperature  measurement 
can  be  written  to  memory  then  simply  read  during  the  second  temperature  test.  After 
the  temperature  coefficient  is  tuned,  the  magnitude  can  be  adjusted  in  one  step,  since  the 
magnitude  adjust  (Ai)  should  have  almost  no  effect  on  the  overall  temperature  coefficient. 

6 Results 

The  circuit  was  simulated  with  various  extremes  of  power  supply  voltage,  processing, 
mismatch  of  threshold  voltage  temperature  coefficients,  and  temperature.  HSpice  [10] 
simulations  show  the  error  to  be  less  than  3%.  Error  can  be  attributed  to  non-zero  step 
size  in  tuning,  and  finite  power  supply  rejection,  especially  to  the  substrate  power  supply. 
The  chip  is  presently  in  layout. 
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7 Conclusion 

Tuning  analog  circuits  with  nonvolatile  memory  provides  a very  powerful  and  linear  way 
to  overcome  the  wide  tolerances  intrinsic  in  semiconductor  processing.  The  disadvantage 
of  using  digital  memory  to  tune  analog  circuits  as  opposed  to  using  the  floating  gate 
transistors  in  an  analog  fashion  is  the  increased  number  of  transistors  necessary  to  do  the 
tuning.  The  advantage  is  that  the  nonlinear  characteristics  of  programming  the  floating 
gate  transistors  can  be  ignored. 
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DC  and  Small-Signal  Physical  Models  for  the 
AlGaAs/GaAs  High  Electron  Mobility  Transistor 

J.  C.  Sarker  and  J.  E.  Purviance 

NASA  Space  Engineering  Research  Center  for  VLSI  System  Design 
Department  of  Electrical  Engineering 
University  of  Idaho,  Moscow,  ID  83843 

Abstract-  Analytical  and  numerical  models  are  developed  for  the  microwave 
small-signal  performance,  such  as  transconductance,  gate-to-source  capaci- 
tance, current  gain  cut-off  frequency  and  the  optimum  cut-off  frequency  of  the 
AlGaAs/GaAs  High  Electron  Mobility  Transistor  (HEMT),  in  both  normal 
and  compressed  transconductance  regions.  The  validated  I-V  characteristics 
and  the  small-signal  performances  of  four  HEMTs  are  presented. 

Nomenclature 

L : Gate  length. 

Z : Gate  width. 

pi  : Low  field  mobility  of  AlGaAs  layer. 

P2  • Low  field  mobility  of  two-dimensional  electron  gas. 

d : Thickness  of  AlGaAs  layer. 

di  : Thickness  of  undoped  AlGaAs  layer. 

w : Width  of  undepleted  region  in  AlGaAs  layer. 

Nd  : Doping  concentration  of  AlGaAs  layer. 

n,  : Sheet  concentration  of  two-dimensional  electron  gas. 

n,0  : Equilibrium  Sheet  concentration  of  two-dimensional  electron  gas. 

^2  : Permittivity  of  AlGaAs. 

Eci  : Saturation  electric  field  of  AlGaAs. 

Ec 2 : Saturation  electric  field  of  two-dimensional  electron  gas. 
vs  : Saturation  velocity  of  two-dimensional  electron  gas. 

/?  : Charge  control  coefficient. 

6 : Effective  width  of  conduction  channel. 

Vtho  ■ Threshold  voltage  for  two-dimensional  electron  gas. 

Vu  : Built-in  voltage  of  Schottky  gate  on  AlGaAs  layer. 

Vp  : Effective  pinch-off  voltage  of  AlGaAs  layer. 

1 Introduction 

High  frequency  solid  state  technology  has  been  moving  towards  the  use  of  the  high  electron 
mobility  transistors  in  microwave  and  in  high  speed  digital  circuits  because  of  its  high 
frequency  operation  and  of  its  tolerance  to  many  forms  of  radiation.  Several  workers 
have  been  studying  the  GaAs  HEMTs  both  theoretically  and  experimentally  since  its  first 
introduction  in  1980  [1].  Over  the  past  years,  analytical,  numerical  and/or  computer-aided 
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Figure  1:  Schematic  Diagram  of  a Uniformly  Doped  AlGaAs/GaAs  HEMT. 


models  have  been  reported  by  many  authors.  But,  because  of  the  complexity  in  struciure 
of  this  device,  VLSI  circuit  designers  demand  a more  accurate  and  compact  model  for  their 
design.  , ■ 

Among  other  workers,  C.Z.  Oil  and  S.  Tansai  [2],  in  1985,  proposed  an  analytical  model 
which  used  the  simple  Trofimenkoff-type  velocity-field  linear  relation  [§].  Their  modeled 
results  agree  very  well  with  the  experimental  data.  However,  their  model  is  good  only  for 
the  linear  normal  transconductance  region;  it  does  not  cover  the  current  saturation  region 
and  also  the  parasitic  conduction  in  the  AlGaAs  layer.  But  the  computer-aided  design 
and  simulation  of  the  HEMT  circuits  demand  a complete  and  more  accurate  model.  In 
1986,  G.W.  Wang  and  W.H.  Ku  [4]  developed  a compact  but  complete  analytical  model 
which  covers  the  whole  operation  range  of  the  dc  characteristics.  This  model  calculates 
the  I-V  characteristics  of  four  different  HEMTs  and  compares  the  modeled  results  with  the 
experimental  data.  We  have  chosen  their  model  as  the  basis  for  this  work  and  from  this 
model  we  have  developed  analytical  and  numerical  models  to  calculate  the  small-signal 
performances,  such  as  transconductance,  gm,  gate-to-source  capacitance,  Cgt%  current  gain 
cut-off  frequency,  fj , and  the  optimum  value  of  the  cut-off  frequency,  ff(opi')  before  current 
saturation  occurs. 


2 DC  Model 

The  basic  structure  of  a HEMT  device  is  significantly  different  from  a conventional  field 
effect  transistor.  A cross  sectional  view  of  a uniformly  doped  AlGaAs/GaAs  HEMT  device 
is  shown  in  Figure  1 . 

At  low  gate  voltage,  it  has  only  one  current  conduction  channel  but  at  high  gate  voltage, 
it  has  two  conduction  channels:  one  is  the  two-dimensional  electron  gas  (2-DEG)  in  the 
interface  between  AlGaAs  and  GaAs  and  the  other  is  the  parasitic  conduction  through  the 
undepleted  n+-AlGaAs  layer.  If  the  AlGaAs  layer  is  not  fully  depleted  by  the  Schottky 
gate  and  the  heterojunction,  then  the  free  carriers  under  the  gate  are  the  two-dimensional 
electrons  and  the  free  electrons  in  the  AlGaAs  layer.  The  width  of  the  undepleted  AlGaAs 
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region  can  be  approximated  by  [5] 

(1) 

By  setting  w = 0,  the  AlGaAs  layer  is  completely  depleted,  one  can  obtain  the  critical 
value  of  the  gate  voltage,  Vg  as 

Uc  = VG( w = 0)  = V,,,  - ^(d  -di-  ^)2  (2) 

The  Vg  £ Vc  defines  the  normal  transconductance  region  where  only  the  2-DEG  is 
the  current  conduction  channel  and  the  Vg  > Vc  defines  the  compressed  transconductance 
region  where  both  the  2-DEG  and  the  undepleted  AlGaAs  layer  are  the  current  conduction 
channels. 

According  to  the  charge  control  model  [6],  the  sheet  charge  density  of  the  2-DEG  can 
be  approximated  as  a linear  function  of  gate  voltage  and  channel  voltage  which  is  given  by 

n,(x)  = (3{Va  - V(x)  - Vth)  (3) 


where  x is  in  the  direction  along  the  heterojunction. 

In  this  dc  model,  for  mathematical  simplicity,  the  Trofimenkoff-type  [3]  electron  velocity- 
field  relation  has  been  used  for  both  the  2-DEG  channel  and  the  AlGaAs  parasitic  con- 
duction channel.  The  linear  electron  velocity-field  can  be  related  as 


v(x ) 


pE(x) 


1 + 


ilsl 

Ec 


(4) 


Here,  E(x)  is  the  electric  field  in  the  2-DEG  channel  or  in  the  undepleted  AlGaAs  layer 
and  Ec  is  the  field  at  which  the  velocity  of  electrons  reach  the  maximum  value  (saturation 
velocity). 

Using  the  charge  control  concept  and  the  velocity-field  relationship  described  above, 
the  current  conducting  through  the  2-DEG  channel  can  be  determined  by 


h-DEG  = Zqn,(x)v(x) 


(5) 


Similarly,  the  current  through  the  undepleted  AlGaAs  layer  can  be  determined  by 

iAiGaA , = ZqNdw(x)v(x)  (6) 


Here,  for  simplicity,  full  ionization  of  the  donor  atoms  has  been  assumed  for  the  current 
through  the  AlGaAs  layer. 


(A)  I-V  Equations  in  the  Normal  Transconductance  Region 

When  the  gate  voltage  is  low,  i.e.  Vg  < Vc,  the  normal  transconductance  region  is  formed. 
This  region  is  then  divided  into  the  linear  (Vd  < Vtat ) and  the  saturation  > V^t) 
regions.  The  current- volt  age  relationship  in  two  different  regions  can  be  derived  as  follows: 
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Figure  2:  Schematic  Diagram  Showing  Current  Saturation  in  the  2-DEG  Channel. 


(i)  Normal  Linear  Hegion  JVd  < V,atJ 

Introducing  equations  (3)  and  (4)  in  equation  (5),  and  integrating  from  source  to  drain 
along  the  2-DEG  channel,  the  current  through  the  channel  is 

A(Va  - Vth  - ^)Vn 


Id 


1 + 


where  A = and  B = LEc2  are  the  model  parameters;  VG  and  Vjo  are  the  internal 

gate  and  drain  volt  ages. 


(ii)  Normal  Saturation  Region  (Vp  > Viat) 

The  velocity-field  relation  (equation  (4))  allows  the  velocity  to  saturate  when  the  electric 
field  approaches  infinity.  But  physically  it  is  impossible;  so,  the  model  assumed  that  the 
velocity  saturation  occurs  when  E > Ec.  Therefore,  from  equation  (4),  the  saturation 
velocity  at  E = Ec  is  v,  = — -- 

When  the  drain  voltage,  Vjy  becomes  greater  than  the  saturation  voltage,  V,at  the 
situation  becomes  like  Figure  2.  At  x = Xc,  electric  field  exceeds  saturation  field,  Ec, 
and  the  electron  velocity  saturates;  and  after  this  the  electrons  move  with  this  constant 
saturation  velocity.  Then,  using  V = V,at  and  % = Ec2  at  x = Lc , equation  (7)  can  be 
written  as 

r _ Zq/3^(Va-  V,k-S^)V„, 

<8> 

Also  from  equations  (3)  and  (5),  the  current  in  the  saturation  region  can  be  written  as 

ID  = Zq(3(va  - Vt/l  - F,atK  = - Vth  ~ V,at)Ec  2 (9) 

Now,  using  the  current  continuity  condition,  equations  (8)  and  (9)  can  be  combined  to 
obtain 

(i  - Kt)b(Vg  - vA) 

4at  (i  - KJB  + (VG  - Vtn) 


(10) 
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where  K\  Generally,  Lc  and  Vsat  can  be  determined  by  solving  a two-dimensional 

Poisson  s equation  which  has  the  form  in  the  velocity  saturation  region 

d2V 

&T  * aI“  (U) 

where  a.  = • Here,  S is  the  effective  width  of  the  conduction  channel  which  is  assumed 

to  be  invariant  to  the  bias  voltage  as  compared  to  Id  and  set  to  a constant.  This  Poisson’s 
equation  is  obtained  by  neglecting  the  variation  of  carrier  concentration  in  the  direction 
perpendicular  to  the  channel  and  can  be  solved  with  boundary  conditions  V{L  = Le)  = Vtat 
and  E{L  — Lc ) = Ec.  The  final  form  of  the  solution  becomes 


ir  T r OiId{L  — Lc)2 

Vd  - V.at  = — ^ — + Ec2(L  - Lc)  = CIdK\  + BKi  (12) 

where  C = ~ = *s  third  model  parameter.  Equations  (10)  and  (12)  can  be 

solved  simultaneously  to  find  K\  and  Vsat  : 


-X  + y/x2  + [2CA{ya  - Ka)2  -4B][1  + ^j^Vp 
"CA{VG-Vtky- 2B 


(13) 


where  X — B+Vq  + Vq  Vth*  Then  from  equation  (8),  the  saturation  current  equation  can 
be  written  as 


r _ A(Va  - Vlh  - 

Id  — ^ — 

1 - Kx  + ^ 


(14) 


(B)  I-V  Equations  in  the  Compressed  Transconductanep  Region 
When  the  gate  voltage  is  high  enough  such  that  VG  > Ve,  the  AlGaAs  layer  starts  to  conduct 
current.  This  current  conduction  mechanism  can  be  considered  similar  to  a parasitic 
MESFET  and  it  is  shown  in  Figure  3. 

When  w = 0 at  x = Lu  from  equation  (1)  the  voltage  inside  the  channel  is  V = Va  = 
Vp  — Vbi  + Vg,  where  Vp  is  defined  as 


Fp  = ^~(d  -di-  ^)2 
P 2e2  ' N/ 


(15) 


When  VD  < Va,  the  sheet  carrier  concentration,  n,  of  the  whole  2-DEG  channel  is 
equal  to  its  equihbrium  value,  n,0  and  it  is  independent  of  gate  and  drain  voltage.  The 
2-DEG  channel  is  then  like  a non-linear  resistor  with  sheet  concentration,  n„,  while  the 
undepleted  AlGaAs  behaves  like  a MESFET.  This  equihbrium  concentration  is  assumed 
to  be  maximum  and  is  given  by  (from  equation  (3)) 


n>»  = P{Vbi  -vp-  Vth)  (16) 

iFrom  the  schematic  diagram  shown  in  Figure  3,  the  compressed  transconductance 
region  can  be  divided  into  three  different  regions  of  operation  : 

(i)  Linear  Region  I : Vp  < V0 


9.3.6 


1 T Tf  \ TT  \ 71 


n»)  = y.,t 

E = Ec 


AlGaAs 
2-DEG  Channel 

GaAs 


Figure  3:  Schematic  Diagram  Showing  Current  Saturation  in  the  2-DEG  Channel  and  the 
Parasitic  Conduction  through  the  AlGaAs  Layer. 


(ii)  Linear  Region  II  :Va  <Vd  < Vsat 

(iii)  Saturation  Region  : Vp  > Vtat 

Here,  the  assumption  V,at  > Va  has  been  made  to  allow  division  into  various  regions  of 
operation.  This  assumption  is  true  for  typical  HEMT  devices. 

(i)  Linear  Region  I 

For  Vp  < Voy  the  current  through  the  AlGaAs  layer  is  derived  as  in  the  case  of  the  MESFET 
and  is  given  by  ... 


A = 


E 


1 + 


,,  ZiVK-Vc  + VDYI'-lyK-Varl' 

VD  


(17) 


where  E ==  - ^ Ni(d  — and  F — LEC x are  two  more  model  parameters: 

The  current  through  the  2-DEG  channel  becomes 


u = 


A(Vbi 


Vp  - Vth)VD 


1 + ^ 
' B 


(18) 


The  total  current  in  this  region  of  operation  is  the  sum  of  these  two  currents  : Ip  = I\  + 1?. 


(ii)  Linear  Region  II 

iFrom  Figure  3,  for  Vp  > Va,  the  current  flowing  through  the  2-DEG  channel  is 


h = 


£>  + fe_ 


£ From  this  equation 


L x 


Zq^2n,0V0  V0 


I2  Ec2 

Current  through  the  AlGaAs  layer  can  be  obtained  from  equation  (6)  as 

,dV 


(19) 


(20) 


r _ Zq^NdW^ 
■ii  = — ; — 3 ir 


1 + 

1 + Ec,  dx 
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Integrating  this  equation  for  V from  0 to  Va  and  for  x from  0 to  Lx,  and  then  using  equation 
(20)  for  Li,  the  final  expression  for  current  becomes 


h = 


~-Mvh-ys-vi^  _i  , llv 
J2  fl  ' F J ¥° 


(21) 


The  derivation  of  the  current  expression  in  the  2-DEG  is  similar  to  the  normal  region. 
But  here,  the  limits  of  integration  for  V are  from  V0  to  Vp  and  for  x from  L\  to  L.  After 
performing  the  integration  and  using  equation  (20)  for  L\,  the  current  through  the  2-DEG 
channel  can  be  obtained  as 


A\(Vc  - V,h  - - V„)  + (VM  - vp  - vyK] 

1 , v„ 


(22) 


So,  the  total  drain  current  is  the  sum  of  equations  (21)  and  (22). 

(iii)  Saturation  Region 

For  the  saturation  region,  Vp  > v,at,  the  current  expression  for  the  undepleted  AlGaAs 
layer  is  the  same  as  the  linear  region  II  (equation  (21)).  The  principle  to  find  the  satu- 
ration voltage  in  this  operating  region  is  similar  to  that  in  the  normal  region  except  the 
contribution  from  the  parasitic  conduction  has  to  be  taken  into  account.  From  the  current 
continuity  at  the  interface  of  the  velocity  saturation  region  and  the  non-saturation  region 
(Figure  3),  the  saturation  voltage  can  be  obtained  as 

Ki)B{Vg  - vth)  + 2 (VG  - Ifc  + VP)Vo  - Vq] 


f sat 


[(1 


(23) 


[( Vo  - Vth)  + (1  - KX)B] 

On  the  other  hand,  the  solution  of  the  Poisson’s  equation  (equation  (12))  in  this  region 


becomes 


VD  - v3at  = 


CAK\  [(VG  - Vth)V,at  ~ % - {Va  - Vu  + VP)V0  + jf] 


l — Ki  + 


+ BK\ 


(24) 


B 


By  solving  equations  (23)  and  (24)  iteratively,  values  of  K\  and  V3at  can  be  found.  Once 
Ki  and  V,at  are  found,  the  current  through  the  2-DEG  channel  can  be  obtained  as 


h = 


a[(vg  - vth  - ^f^)(vsat  - v0)  + (yw  - vp  - vth)v0] 


Kx  + 


v, 

B 


(25) 


The  toted  drain  current  is  then  the  sum  of  equations  (21)  and  (25). 

In  the  subthreshold  region  of  operation  the  charge  control  is  not  linear;  so,  in  addition  to 
the  model  and  physical  parameters,  a fitting  parameter,  D is  used  to  model  the  threshold 
voltage  shift  of  the  2-DEG  caused  by  the  drain  voltage.  This  simple  threshold  voltage 
correction  is  given  by 

Vth  = Vtho-DxVD  (26) 

So,  with  the  nine  parameters  A,  B,  C,  E,  F,  Vp,  Vth0,  and  D,  the  I-V  characteristics 
of  the  AlGaAs/GaAs  HEMT  device  can  be  modeled  completely. 


9.3.8 


3 Small-Signal  Model 

Evaluation  and  analysis  of  the  small-signal  performances  of  the  HEMT  are  important  for 
the  operation  of  microwave  circuits.  The  HEMT  is  usually  biased  in  the  normal  transcon- 
ductance region  without  parasitic  conduction  for  optimal  low-noise  and/or  high-frequency 
performance.  Some  of  the  small-signal  parameters  like  transconductance,  gate-to-source 
capacitance,  current  gain  cut-off  frequency  etc.  can  be  derived  analytically  from  this 
model.  The  derivation  of  these  parameters  in  the  saturated  normal  region  and  also  in  the 
compressed  transconductance  region  is  mathematically  complicated  and  computationally 
involves  more  CPU  time.  So,  to  determine  these  parameters  in  those  regions  of  operation 
a computationally  efficient  numerical  technique  has  been  used.  Methods  of  determining  of 
these  small-signal  parameters  are  discussed  in  the  next  few  subsections. 


3.1  Transconductance,  gm 

The  intrinsic  transconductance,  gm  at  constant  drain  voltage  is  defined  as 

m — Qy  I Vp=  Conttant 

The  gm  in  the  linear  normal  region  can  be  obtained  analytically  by  differentiating  drain 
current  (equation  (7))  with  respect  to  gate  voltage  l a £ ■> :r.~r r 


9m  — 


d 

dVG 


A(VG-Vth-Y*)VD 


1 + 


AVd 

1 + 1-h 

1 + B 


(27) 


The  transconductance  increases  with  drain  voltage  before  current  saturation  and  is 
inversely  proportional  to  gate  length  and  mobility  degradation  factor  (1  + Ho.). 

To  calculate  gm  in  the  saturation  region  and  in  the  compressed  (both  linear  and  sat- 
uration) region,  we  differentiate  the  corresponding  drain  currents  numerically.  For  this 
numerical  differentiation  we  have  used  the  centered-finite-divided  difference  equation  of 
the  form  [7] 


M = 

VGi+l  ~ VGi_t  _ 


Here,  gm ( VGi ) is  the  transconductance  evaluated  at  the  t **  point. 


(28) 


3.2  Gate-to-Source  Capacitance,  C,f 

Gate-to-source  capacitance,  Cg)  is  defined,  with  the  assumption  Cgd  < C as 

r = 

9>  ~ dVG 


where  Qf  is  the  total  charge. 
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In  the  normal  region,  the  AlGaAs  layer  is  completely  depleted,  so  the  Cg,  is  due  only 
to  the  two-dimensional  electron  gas.  Thus,  for  the  normal  region 


d 

dVG 


qn,(x)dx 


Substituting  equation  (3)  for  n,(x ) and  then  performing  the  integration,  we  get 


/ q0(VG  - V(x)  - vth)dx 

J o 


AL\2 + \) 


V2 


(29) 


Calculation  of  gate-to-source  capacitance  in  the  saturation  region  is  more  complicated 
because  of  complexity  in  the  total  charge  calculation  in  the  channel.  The  method  we  have 
used  to  calculate  the  charge  in  the  channel  is  given  in  detail  in  reference  [8].  The  final 
expression  of  total  charge,  Qx  becomes 


Qt  = — 
^2 


(Va  - Vlh)L  - Vm(L  - ^)] 


(30) 


where 

x,— (31) 

In  this  equation,  the  saturation  current,  IG  is  calculated  by  using  equation  (14)  at  the 
saturation  voltage,  Vsat. 

Once  we  know  the  total  charge  in  the  channel  we  can  calculate  the  Cg,  by  using  nu- 
merical differentiation.  The  form  of  this  differentiation  is  analogous  to  the  gm  equation 


C9s(VGi) 


Qt(Vg,+1)  - Qr(Vg._l) 

Vbi+l  - Vb.,, 


(32) 


Ideally,  to  calculate  Cg,  in  the  compressed  region,  the  capacitance  due  to  the  charge 
accumulated  in  the  undepleted  AlGaAs  layer  has  to  be  added  with  the  capacitance  due  to 
the  2-DEG  channel.  But  the  calculation  of  the  capacitance  due  to  AlGaAs  layer  analyt- 
ically from  this  model  is  not  very  straightforward.  Moreover,  this  additional  capacitance 
contribution  may  not  be  very  significant,  particularly  at  high  drain  voltages.  So,  in  this 
work  we  have  neglected  this  contribution  compared  to  the  capacitance  due  to  the  2-DEG 
channel  charge.  Therefore,  equation  (32)  has  also  been  used  to  calculate  the  gate-to-source 
capacitances  in  the  compressed  transconductance  region. 


3.3  Current  Gain  Cut-off  Frequency,  fT 

In  microwave  applications,  the  current  gain  cut-off  frequency  is  the  frequency  used  as  an 
indicator  of  the  device  speed.  The  conventional  definition  of  fT  is 

fT  = .Jhn— 

2 irCg, 


er 


9.340 


III  the  normal  linear  region,  we  calculated  fj  analytically  by  using  equations  (27)  and  (29); 

(33) 


1 

■ avd  ' 

^2 

2tt 

1 + 

[AP(  2+^jJ 

"I  Vn\(  O i Vn 


We  agaift  adopted  the  numerical  techniques  to  calculate  fx  for  normal  saturated  region 
and  both  linear  and  saturated  compressed  regions.  This  numerical  expression  is  given  by 


/r(Vb,)  = 


$m(VG<) 


2*Cg.{Va,) 

Here,  the  /y,  gm  and  Cg3  are  calculated  at  the  ith  point. 


(34) 


3,4  Optimum  Cut-off  Frequency,  fT( opt) 

Another  important  parameter  in  microwave  applications  is  the  optimum  frequency,  /y(opt). 
This  optimum  frequency  is  defined  as  the  maximum  value  of  the  current  gain  cut-off  fre- 
quency just  before  current  saturation  occurs.  Thus,  in  the  normal  transconductance  region, 
/y(opt)  is  approximated  as 

PjF.at  


fr(opt ) = 


2tt^(1  + ^)(2  + Y) 


(35) 


Here,  V $«t,  the  value  of  the  saturation  voltage  when  current  just  starts  to  saturate,  can  be 
evaluated  by  setting  Ki  — 0 in  equation  (10)  ; 


V,at  = 


B(Vg  — Vth) 

B + (Vg-  Vth) 


(36) 


4 Results  and  Discussion 

4,1  The  I-V  Characteristics  ... ... — ^ 

To  validate  the  dc  model  we  have  developed  a computer  simulation  program  which  cal- 
culates the  I-V  characteristics  over  the  entire  region  of  operation.  Using  this  simulation 
program  we  have  calculated  the  TV  characteristics  of  all  the  four  HEMTs,  The  device 
physical  parameters  and  the  modeling  parameters  of  these  HEMTs,  taken  from  reference 
[4],  are  given  in  Table  1, 

In  the  derivation  of  the  drain  current  equations  in  section  2,  the  dc  model  does  not 
include  the  effects  of  parasitic  source  and  drain  resistances  explicitly.  These  effects  can  be 
taken  into  account  in  the  model  by  solving  the  nonlinear  equations  which  are  given  below 

Vgs  = Vg  + Id(Vg,Vd)Rs  (37) 


and 


Vps  — Vjp  T 7p(Vb>  Vd)(Rs  + Rd) 


(38) 
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Device 

HEMT  #1  (TRW  #2078) 

HEMT  #2 

HEMT  #3  (GE  #5410)  | HEMT  #4 

L(j«n) 

0.35 

1.0 

0.25 

1.0 

Z{nm) 

65 

145 

100 

1200 

V<Ho(V) 

-0.017 

-0.901 

-0.912 

-2.389 

K(V) 

- 

- 

1.481 

2.319 

Vhi(V) 

- 

- 

0.85 

0.85 

A(mA/V2) 

49.517 

101.253 

103.539 

454.167 

B(V) 

5.285 

1.604 

0.616 

0.948 

C(KQ) 

8.341 

0.583 

0.992 

0.201 

D 

0.015 

0 

0.092 

0.008 

E(mA/V) 

- 

- 

81.825 

542.663 

F(V) 

! 

- 

2.154 

3.04 

Rs{n) 

5.9 

7.0 

4.6 

1.0 

RD(n) 

6.0 

7.0 

6.0 

1.0 

Table  1:  Physical  and  Model  Parameters  of  the  HEMTs. 


where  Vgs  and  Vos  are  the  externally  applied  gate  and  drain  voltages  respectively;  Rs 
and  Ro  are  the  parasitic  source  and  drain  resistances.  These  two  equations  were  solved 
iteratively  in  the  program  to  find  the  values  of  Vg  and  Vo  for  given  values  of  external 
voltages  Vgs  and  Vos- 

The  HEMT  #1  and  #2  show  only  normal  transconductance  effects;  only  five  model 
parameters,  Vtho,  A,  B ,C  and  D are  needed  in  the  program  to  calculate  the  I-V  relation. 
With  these  parameter  values  and  using  equations  (7),  (10),  (13),  (14),  (37)  and  (38),  we 
have  developed  a simulation  program  which  calculates  the  drain-to-source  current  as  a 
function  of  external  drain  voltage  for  different  external  gate  voltages. 

Figure  4 shows  the  I-V  curve  of  the  HEMT  #1.  In  the  program,  we  have  swept  the 
drain  voltage  from  0 to  3 volts  with  a 0.2  volts  steps  and  calculated  drain-to-source  currents 
for  gate  voltages  Vgs  = 0,  0.1,  0.2,  0.3,  0.4  and  0.5  volts.  As  a comparison,  we  have  also 
plotted  the  experimental  data  obtained  from  reference  [4].  From  the  figure,  we  can  see  a 
nice  agreement  between  our  I-V  results  and  the  experimental  data. 

Simulated  results  along  with  experimental  data  [4,6]  of  the  HEMT  #2  are  shown  in 
Figure  5.  In  this  case  the  drain  voltage  was  varied  from  0 to  3 volts  with  0.25  volts  steps. 
Drain  currents  for  Vgs  = -0.8,  -0.6,  -0.4,  -0.2  and  0 volts  were  calculated.  The  low  gate 
bias  curves  agree  very  well  with  the  experimental  values.  As  the  gate  bias  increases  a small 
deviation  occurs  near  the  linear  and  saturation  transition  region. 

The  I-V  characteristics  of  the  HEMT  #3  and  #4  (double  heterojunction  HEMT)  are 
more  complex  because  of  the  compressed  transconductance  effect  (in  addition  to  the  normal 
transconductance  effect).  Four  additional  parameters  Vp , V^i,  E and  F are  needed  to  model 
this  effect.  So,  with  the  nine  parameter  values  listed  in  Table  1 and  using  the  equations 
(17),  (18)  and  (21-25),  we  have  calculated  the  drain-to-source  currents  in  the  compressed 
transconductance  region.  Equation  (24)  was  rearranged  such  that  K\  can  be  written  in 
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Figure  4:  Characteristics  of  HEMT  #1  Figure  5:  I-V  Characteristics  of  HEMT  #2 
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Pigure  6:  Characteristics  of  HEMT  #3  Figure  7:  I-V  Characteristics  of  HEMT  #4 
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terms  of  Vtat 

_-(B+VD)+y/(B+VD)*+ 4[CAY-B][{l  + Vp)(VD-V,«)] 
1 ~ 2 {CAY-B} 


(39) 


where  Y = (Vg  — — — Equations  (23)  and  (39)  were  solved  iteratively  in  the 
program  by  assuming  an  initial  value  of  K\  = 0 to  obtain  V,at  and  then  K\.  After  K\  and 
V,at  are  known,  the  drain  current  through  the  2-DEG  channel  in  the  saturation  region  is 
calculated  by  using  equation  (25). 

Figure  6 shows  the  I-V  curve  of  the  HEMT  $3.  Here,  we  have  scanned  the  drain 
voltage  from  0 to  3 volts  at  a step  of  0.2  volts  for  the  gate  voltages,  Vgs  = -1.0,  -0.8, 
-0.6,  -0.4,  -0.2  and  0 volts.  As  we  can  see  the  modeled  result  agrees  very  well  with  the 
experimental  data  [4]. 

Finally,  we  have  calculated  the  I-V  characteristics  of  a double  heterojunction  HEMT 
(HEMT  #4)  and  the  results,  along  with  the  experimental  data  are  shown  in  Figure  7. 
These  results  also  agree  fairly  well  with  the  published  measured  data  [4]. 

Two  of  the  four  HEMTs  (HEMT  #1  and  #3)  are  sub-half-micron  gate  HEMTs.  Un- 
modeled short  channel  effects  such  as  velocity  overshoot  and  unmodeled  hot  carrier  effects 
may  occur  in  these  two  HEMTs.  It  is  reported  that  these  effects  start  to  become  prominent 
below  0.25 pm  gate  length  [9],  therefore  HEMT  #3  may  show  considerable  short  channel 
effect  in  the  compressed  transconductance  region.  Moreover,  this  dc  model  was  originally 
developed  only  for  the  single-heterojunction  HEMT.  But  from  our  simulation  results  of 
HEMT  #4,  which  is  a double-heterojunction  HEMT,  we  found  that  this  model  also  appears 
to  be  good  for  the  double-heterojunction  HEMT. 


4.2  Small-Signal  Performance  Calculation 

Based  on  the  equations  derived  in  section  3 and  the  physical  parameters  listed  in  Table  1, 
we  have  developed  the  simulation  program  which  calculates  the  small-signal  performances. 
Using  this  program  we  have  calculated  Id , gm,  Cg ,,  fo  and  /r(opt)  as  a function  of  gate 
voltage  keeping  drain  voltage  fixed.  Table  2 shows  the  small-signal  parameter  values  for 
all  the  four  HEMTs  for  different  drain  and  gate  bias  conditions. 

This  small-signal  model  has  been  developed  in  an  academic  environment,  based  on 
a quasi-static  approximation.  The  values  of  the  small-signal  parameters  are  essentially 
theoretical  and  have  not  been  rigorously  validated  in  this  work  because  of  the  unavailability 
of  the  experimental  data. 


5 Conclusion 

A complete  analytical  dc  model  for  the  uniformly  doped  AlGaAs/GaAs  HEMT  device 
has  extensively  analyzed  and  validated  independently.  Based  on  the  model  a simulation 
program  was  developed  to  calculate  the  I-V  characteristics.  Using  this  program,  the  I-V 


Device 

Bias  Condition 

gm{mS) 

h{GHz) 

fT{opt){GHz) 

HEMT  #1 

Normal  Linear  Region 
VGS  = 0.4V,  Vvs  = 0.4V 

3.677 

16.526 

28.501 

92.283 

96.055 

HEMT  #1 

Normal  Saturation  Region 
VGS  = 0.2V,  VDS  = l.OV 

1.376 

10.868 

6.5129 

265.58 

- 

HEMT  #2 

Normal  Linear  Region 
VGS  = -0.2V,  VDS  = 0.5V 

12.389 

27.471 

507.09 

8.622 

10.698 

HEMT  #2 

Normal  Saturation  Region 
VGS  = -0.2V,  VDS  = 1.0V 

15.461 

31.098 

134.89 

36.692 

- 

HEMT  #3 

Normal  Linear  Region 
VGS  = -0.65V,  VDS  = 0.2V 

33.553 

66.432 

69.527 

HEMT  #3 

Normal  Saturation  Region 

VGS  = -0.8V,  VDS  = l.W 

- 

HEMT  #3 

Compressed  Linear  Region 
VGS  = -0.2V  VDS  = 0.4V 

- 

: HEMT  #3 

Compressed  Saturation  Region 
VG5  = -0.4V  VDS  = l.OV 

237.45 

- 

HEMT  #4 

Normal  Linear  Region 
VGS  = -1.5V  VDS  = 0.5V 

116.277 

1583.0 

11.89 

13.18 

HEMT  #4 

Normal  Saturation  Region 
VGS  = -1.5V  VDS  = l.OV 

88.939 

133.306 

42.36 

- 

HEMT  #4 

Compressed  Linear  Region 
Vgs  = -0.5V,  VDS  = l.OV 

6.996 

- 

HEMT  #4 

Compressed  Saturation  Region 

VGS  = -l.OV,  vDS  = l.ov 

153.567 

548.7 

35.77 

Table  2:  Small-Signal  Performances  of  the  AlGaAs/GaAs  HEMTs  Calculated  in  this  Work. 


curves  for  four  HEMTs  were  successfully  calculated  and  compared  with  the  experimental 
data  reported  earlier  [4,6]. 

In  the  second  phase  of  the  work,  analytical  and  numerical  methods  were  developed 
to  predict  some  of  the  important  small-signal  performances  of  these  HEMTs.  Based  on 
this  new  computer-aided  model,  the  small-signal  parameters,  gm,  Cgg%  fr  and  fr{opt) 
were  calculated  and  are  presented  in  Table  2.  The  proposed  small-signal  model  for  the 
AlGaAs/GaAs  HEMT  device  maybe  useful  to  VLSI  and  microwave  applications  in  future. 
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Abstract:  The  formal  specification  of  a high  speed  CMOS  correlator  is  presented. 
The  specification  gives  the  high-level  behavior  of  the  correlator  and  provides 
a clear,  unambiguous  description  of  the  high-level  architecture  of  the  device. 


1 Introduction. 

The  use  of  formal  specification  in  designing  VLSI  circuits  has  many  benefits.  Perhaps 
the  most  important  result  is  a clear  description  of  the  design’s  behavior  that  can  be  used 
for  communication  among  design  engineers,  production  engineers,  test  engineers,  technical 
writers,  and,  perhaps  most  importantly,  customers.  Formal  specifications  also  provide  a 
firm  foundation  upon  which  analysis  of  the  circuit  design  can  take  place.  This  analysis 
has  the  potential  to  significantly  reduce  design  errors  as  well  as  providing  a basis  for 
demonstrating  that  the  design  has  desired  properties. 

This  paper  presents  the  formal  specification  of  a high-speed  CMOS  correlator  [2],  The 
correlator,  which  is  designed  to  be  used  in  a space-born  spectrometer,  contains  32  channels 
and  is  capable  of  sampling  at  25MHz. 

2 Formal  Specification  and  Verification. 

VLSI  devices  can  be  specified  at  many  levels  of  abstraction  [8].  Generally,  we  need  at  least 
a behavioral  and  a structural  specification  [4],  The  behavioral  specification  is  written  in 
logic  and  unambiguously  describes  the  expected  behavior  of  the  device.  The  behavioral 
specification  is  declarative  rather  than  imperative,  giving  a clear  relationship  between  the 
inputs,  current  state,  and  outputs. 

The  structural  specification  describes,  again  using  logic,  how  the  circuit  is  put  to- 
gether. Ideally,  the  structural  specification  can  be  derived  from  design  information  cap- 
tured by  conventional  CAD  tools  or  translated  from  a hardware  description  language  such 
as  VHDL  [6]. 

Verification  is  nothing  more  than  a mathematical  analysis  of  the  behavioral  and  struc- 
tural models.  Ideally,  we  would  like  to  show  that  the  intended  behavior  follows  from  the 
structure.  This  analysis,  which  is  a type  of  symbolic  simulation,  can  be  done  by  hand  or 
with  the  aid  of  mechanical  verification  tools  [5].  These  mathematical  models  can  also  be 
used  to  analytically  demonstrate  selected  behavioral  properties  for  a computer  system. 
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3 A Brief  Introduction  to  HOL. 

To  formally  model  hardware  and  to  ensure  the  accuracy  of  our  proofs,  we  felt  that  it 
was  necessary  to  develop  the  proofs  and  properties  using  a mechanical  verification  system. 
This  prevents  proofs  from  containing  logical  mistakes,  and  assures  that  the  foundations 
on  which  the  work  is  Based  are  sound.  Due  to  the  nature  of  the  proofs,  which  include 
quantification:  over  sets  of  objects,  we  felt  that  a system  which  supports  higher-order  logic 
and  a typed  lambda  calculus  would  facilitate  our  efforts.  The  HOL  system  was  selected  for 
this  project  due  to  its  support  for  higher-order  logic,  generic  specifications  and  polymorphic 
type  constructs.  Furthermore  its  availability,  ruggedness,  local  support,  and  a growing 
world- wide  user  base  made  it  a very  attractive  selection.  In  this  section  we  will  provide  a 
brief  description  of  HOL. 

HOL  is  a general  theorem  proving  system  developed  at  the  University  of  Cambridge 
[5,1]  that  is  based  on  Church’s  theory  of  simple  types,  or  higher-order  logic  [3].  Although 
Church  developed  higher-order  logic  as  a foundation  for  mathematics,  it  can  be  used  for 
reasoning  about  computational  systems  of  all  Binds.  Similar  to  predicate  logic  in  allowing 
quantification  over  variables,  higher-order  logic  also  allows  quantification  over  predicates 
and  functions  thus  permitting  more  general  systems  to  be  described. 

HOL  is  not  a fully  automated  theorem  prover  But  is  more  than  simply  a proof  checker, 
falling  somewhere  between  these  two  extremes.  HOL  has  several  features  that  contribute 
to  its  use  as  a verification  environment: 

1.  Several  built-in  theories,  including  booleans,  individuals,  numbers,  products,  sums, 
lists,  and  trees.  These  theories  build  on  the  five  axioms  that  form  the  basis  of  higher- 
order  logic  to  derive  a large  number  of  theorems  that  follow  from  them. 

2.  Rules  of  inference  for  higher-order  logic.  These  rules  contain  not  only  the  eight  basic 
rules  of  inference  from  higher-order  logic,  but  also  a large  body  of  derived  inference 
rules  that  allow  proofs  to  proceed  using  larger  steps.  The  HOL  system  has  rules  that 
implement  the  standard  introduction  and  elimination  rules  for  Predicate  Calculus  as 
well  as  specialized  rules  for  rewriting  terms. 

3.  A large  collection  of  tactics  to  support  goal  directed  proof.  Examples  of  tactics 
include  REWRITE_TAC  which  rewrites  a goal  according  to  some  previously  proven  the- 
orem or  definition,  GEN.TAC  which  removes  unnecessary  universally  quantified  vari- 
ables from  the  front  of  a goal,  and  EQ_TAC  which  says  that  to  show  two  things  are 
equivalent,  we  should  show  that  they  imply  each  other. 

4.  A proof  management  system  that  keeps  track  of  the  state  of  an  interactive  proof 
session. 

5.  A metalanguage,  ML,  for  programming  and  extending  the  theorem  prover.  Using 
the  metalanguage,  tactics  can  be  put  together  to  form  more  powerful  tactics,  new 
tactics  can  be  written,  and  theorems  can  be  aggregated  to  form  new  theories  for  later 
use.  The  metalanguage  makes  the  verification  system  extremely  flexible. 
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Operator 

Application 

Meaning 

- 

tl  = t2 

tl  equals  t2 

f 

tl  ,t2 

the  pair  tl  and  t2 

A 

tl  A t2 

tl  and  t2 

V 

tl  V t2 

tl  or  t2 

tl  t2 

tl  implies  t2 

Table  1:  HOL  Infix  Operators 


Binder 

Application 

Meaning 

V 

V X.  t 

for  all  x,  t 

3 

3 x.  t 

there  exists  an  x such  that  t 

£ 

£ X.  t 

choose  an  x such  that  t is  true 

Table  2:  HOL  Binders 


In  the  HOL  system  there  are  several  predefined  constants  which  can  belong  to  two 
special  syntactic  classes.  Constants  of  arity  2 can  be  declared  to  be  infix.  Infix  operators 
are  written  Mrandl  op  rand2"  instead  of  in  the  usual  prefix  form:  "op  randl  rand2". 
Table  1 shows  several  of  HOL’s  built-in  infix  operators. 

Constants  can  also  belong  another  special  class  called  binders.  A familiar  example  of 
a binder  is  V.  If  c is  a binder,  then  the  term  "c  x.t"  (where  x is  a variable)  is  written  as 
shorthand  for  the  term  "c(A  x.  t)".  Table  2 shows  several  of  HOL’s  built-in  binders. 

In  addition  to  the  infix  constants  and  binders,  HOL  has  a conditional  statement  that 
is  written  a — > b | c,  meaning  “if  a,  then  b,  else  c.” 

4 The  Correlator  Design. 

The  correlator  is  designed  for  a space  borne  spectrometer.  The  design  accepts  two  2- 
bit  data  streams  clocked  at  a maximum  of  25MHz.  Delayed  versions  of  one  stream  are 
multiplied  (using  a biased  multiplication)  with  the  undelayed  signal  on  the  other  stream. 
The  products  are  accumulated.  The  process  continues  for  the  duration  of  the  integration 
cycle  which  is  defined  by  the  int  control  line.  When  the  end  of  an  integration  period 
is  signaled,  the  results  are  latched  into  a register,  the  accumulators  are  cleared,  and  the 
dataxdy  line  goes  high  to  signal  that  data  is  ready  to  be  read  from  the  chip.  A new 
integration  cycle  can  begin  immediately.  Concurrent  with  the  new  integration  period,  the 
data  from  the  previous  integration  period  can  be  read  on  the  output  lines.  Data  is  read  in 
either  a word  serial  or  byte  serial  mode  depending  on  the  value  of  a control  line. 

Readers  interested  in  additional  detail  are  referred  to  [2]. 


5 The  Correlator  Specification. 

This  section  presents  the  behavioral  specification  of  the  correlator. 
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Figure  1:  Architecture  of  the  correlator  shows  the  producer — 
consumer  relationship  between  the  INT  interpreter  and  the 
10  interpreter. 

The  overall  architecture  of  the  behavioral  description  is  shown  in  Figure  1.  The  archi- 
tecture is  based  on  two  separate  state  machines  which,  along  with  the  datapath,  function 
as  single  instruction  interpreters  [7].  The  interpreters  are  arranged  in  a producer— consumer 
architecture  with  a register  serving  as  the  shared  lint  between  the  two  interpreters. 

The  producer  portion  of  the  design  is  the  INT  interpreter.  INT  performs  the  integra- 
tion of  the  incoming  signals  in  32  channels.  The  interpreter  controls  the  following  state 
Variables: 

• acc — A bank  of  32,  4-bit  accumulators. 

• delay — A bank  of  32,  2— bit  delay  elements. 

• sr — A bank  of  32,  24-bit  shift  registers. 

• count— A bank  of  32,  24-bit  counters. 

Bach  of  these  state  variables  is  parameterized  for  time  and  channel  number  and  has  type 
:time— >num— >u>,  where  w varies  with  register  width. 


The  specification  for  INT  relates  the  state  variables  at  time  t + 1 to  the  their  value  at 
time  i and  the  value  of  the  inputs  at  time  t. 


iiiii  ml  mi i II ill 
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The  function  nextstate  evaluates  to  either  integrate  or  dump  depending  on  the  value  of 
the  int  line. 

The  individual  instructions  produce  new  values  for  the  state  variables.  In  the  case 
of  the  integrate  instruction  new  values  are  calculated  for  the  acc,  delay,  and  count 
variables.  The  shift  register  (sr)  is  unchanged. 


b def  integrate  (acc,  delay,  sr,  count,  datardy) 

(a,  b,  int,  ra)  = 

let  signal__product  n = mapper  (delay  n)  b in  ( 
let  new.acc  n = 

m — * (bt4_ival  0)  | 

(add4  (signal_product  n,  acc  n))  in 
let  new_delay  n = (n— 0)  — ► a | (delay  (n— 1))  in 
let  new_count  n = 

rn  — ► (wordn  0)  | 

(carry4  (signal_product  n,  acc  n))  — ♦ inc  (count  n)  | 

(count  n)  in 

(new. acc,  new.delay,  sr,  new.count , datardy) 


The  new  values  are  precisely  described.  For  example,  the  new  value  of  the  nth  accumulator 
is  calculated  by  adding  a biased  multiplication  of  the  n-delayed  signal  and  the  undelayed 
signal  to  the  current  value  in  the  same  accumulator. 

The  consumer  portion  of  the  circuit  is  the  10  interpreter.  The  interpreter  controls  the 
following  state  variables: 

• sr — A bank  of  32,  24-bit  shift  registers.  This  is  the  same  register  as  the  sr  register 
in  the  INT  interpreter. 

• counter — A 7-bit  counter  for  counting  the  output. 

• out — A 16-bit  register  that  latches  the  values  on  the  output  lines. 

• borw — A state  variable  that  indicates  whether  output  is  byte  or  work  serial. 

The  specification  for  the  10  interpreter  is  similar  to  the  specification  of  the  INT  inter- 
preter. The  10  interpreter  has  six  instructions.  The  interpreter  can  be  reset,  it  can  start 
the  read  cycle,  it  can  end  the  read  cycle,  it  can  dump  data  from  the  output  registers  a 
byte  at  a time,  it  can  dump  data  a word  at  a time,  or  it  can  do  nothing. 
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b def  io_int  (*r,  counter,  out,  borw,  datardy,  begin) 

(byte,  rn,  outck)  = 

V t , 

let  nextstate  = 

((rn  t)  -»  ***•♦  I 

((datardy  t)  A begin  -»  atart.read  | 

((datardy  t)  A ((val  (counter  t))  = 0))  — » and_read  | 

((datardy  t)  A (borw  t)  A (outck  t)  —*  dump_byte  | 

((datardy  t)  A ^(borw  t)  A (outck  t)  ->  dump_word  | 

noop  ) in  ( 

(ar  (t-fl),  counter  (t+1),  out  (t-fi),  borw  (t+i),  datardy  (t+i))  = 
next  state  (sr  t,  counter  t,  out  t,  borw  t,  datardy  t,  begin  t) 
(byte  t,  m t,  outck  t)) 


The  operation  of  10  is  more  complicated  that  the  operation  of  INT,  Whenever  the  reset 
line  is  raised,  the  state  is  reset  as  described  in  the  specification  of  the  reset  operation. 
When  the  datardy  line  goes  high,  the  interpreter  begins  a read  cycle.  When  the  outck 
line  is  raised  and  the  datardy  line  is  high,  we  dump  either  bytes  or  words  depending  on  the 
yalue  of  the  borw  line.  There  is  a counter  so  that  the  correct  number  of  bytes  and  words 
are  dumped.  When  the  counter  reaches  0 we  end  the  read  cycle  (by  pulling  the  datardy 
line  low).  Otherwise,  we  do  nothing.  1 

As  an  example  of  the  instructions  in  10,  consider  the  dumpjword  instruction. 


b<j</  dump_word  (sr,  counter,  out,  borw,  datardy,  begin) 
(byte,  m,  outck)  = 
let  new.counter  ==  (dec  counter)  in 
let  i = (val  counter)  in 
let  new_out  = short  (sr  i)  in 

(sr,  new_counter,  new_out,  borw,  datardy,  begin) 


The  instruction  updates  the  counter  by  decrementing  the  old  value.  The  value  on  the 
output  is  determined  by  16  most  significant  bits  from  the  ith  shift  register,  where  i is  the 

value  of  the  counter.  i 

The  most  interesting  feature  of  the  specification  of  INT  and  10  is  that  they  share  state. 
For  example,  both  specify  changes  to  sr,  the  variable  representing  the  shift  register.  INT 
produces  a value  that  is  placed  in  sr  by  its  dump  instruction.  10  uses  that  value  when 
asked  to  present  the  results  of  the  integration  on  the  output  lines.  ■ ' 

Both  interpreters  also  specify  changes  to  datardy,  the  variable  representing  whether 
or  not  data  is  ready  to  be  output.  INT  sets  datardy  when  it  has  dumped  the  contents  of 
the  the  counter  into  the  shift  register.  10  resets  datardy  when  it  is  done  outputting  the 
data. 

Readers  of  this  specification  who  are  familiar  with  the  design  may  be  surprised  to  find 
that  some  details  in  the  circuit  are  not  found  in  the  specification.  For  instance,  after  the 
end  of  the  integration  period  ends,  there  is  an  8 cycle  delay  before  data  can  be  read  from 

iNote  that  count  in  IHT  and  counter  in  10  are  two  different  state  variables. 
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the  chip  (i.e.  datardy  goes  high).  In  the  specification  shown  above  datardy  goes  high  the 
time  period  after  the  int  fine  is  pulled  high.  This  is  an  example  of  the  temporal  abstraction 
going  on  between  the  circuit  levels  of  the  specification  and  the  behavioral  specifications 
given  here. 


6 The  Top-Level  Specification 

The  final  specification  combines  the  specifications  of  the  two  interpreters  and  operates 
them  in  parallel. 


corr_top  r«p  (acc,  delay,  sr,  count,  datardy, 
begin,  counter,  out,  borw) 

(byte_e , rn,  outck,  a,  b,  int)  = 

( (integrate_int  rep  (acc,  delay,  sr,  count,  datardy) 
(a,  b,  int,  rn))  A 

(io_int  rep  (sr,  counter,  out,  borw,  datardy,  begin) 
(byte_e,  rn,  outck))) 


The  specification  does  not  explicitly  answer  questions  regarding  the  shared  use  of  the 
sr  and  datardy  lines.  For  example,  do  INT  and  10  correctly  coordinate  the  writing  and 
reading  of  sr  correctly?  This  and  other  important  questions  regarding  the  operation  of 
the  correlator  can  be  answered  by  analysis  of  the  specification. 


7 Conclusion. 

This  paper  has  presented  the  behavioral  specification  for  a VLSI  correlator  design.  Previ- 
ous to  this  specification  being  written,  the  design  was  described  in  design  documents  and 
papers  such  as  [2].  These  descriptions  were  necessarily  ambiguous  since  they  were  written 
in  English.  Deriving  the  specification  by  reading  the  design  documents  and  talking  to  the 
design  engineer  provides  an  interesting  perspective  on  the  design  process.  The  behavioral 
specification  of  the  correlator  documents  the  design  and  is  useful  for  enhancing  communi- 
cation between  designers,  customers,  and  users  by  unambiguously  describing  the  function 
of  the  device. 

The  specification  presented  in  this  paper  is  a snapshot  of  the  design.  A specification  is 
constantly  subject  to  revision  to  bring  it  up  to  date  with  current  expectation  and  to  correct 
errors  that  are  part  of  any  written  description.  Future  work  will  extend  the  specification 
in  two  ways: 

• We  intend  to  show  that  the  specification  meets  certain  requirements  for  correct  op- 
eration. For  example,  the  analysis  will  make  explicit  the  synchronization  conditions 
that  must  exist  between  the  two  interpreters  for  the  chip  to  function  correctly  and 
show  that  they  are  met. 
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• We  will  specify  the  structural  level  by  deriving  it  from  the  design  information  cap- 
tured in  the  HDL  description  of  the  circuit.  We  intend  to  show  that  this  structural 
specification  implies  the  architecture  we  have  described  above. 
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Abstract  - The  integration  of  modern  CAD  tools  with  formal  verification  envi- 
ronments require  translation  from  hardware  description  language  to  verifica- 
tion logic.  A signal  representation  including  both  unknown  state  and  a degree 
of  strength  indeterminacy  is  essential  for  the  correct  modeling  of  many  VLSI 
circuit  designs.  A higher-order  logic  theory  of  indeterministic  logic  signals  is 
presented. 

1 Introduction 

As  higher  transistor  counts  increase  the  complexity  of  VLSI  circuits  and  the  number  of 
potential  test  cases  explode,  formal  verification  methods  promise  value  in  design  fault 
exclusion.  Before  verification  is  accepted  by  design  engineers,  stand  alone  verification  tools 
that  are  used  in  the  academic  research  arena  must  be  integrated  with  the  CAD  tools  being 
used  by  VLSI  designers.  One  major  benefit  of  this  integration  is  that  VLSI  designers  will 
enjoy  increased  confidence  that  abstract  behavioral  models  are  correct.  There  are  several 
reasons  a VLSI  designer  may  choose  to  use  abstract  behavioral  models.  In  a top-down 
design,  a behavioral  description  may  be  used  to  simplify  circuit  understanding  before  the 
implementation  is  designed.  A behavioral  model  can  be  utilized  as  part  of  a simulation  of 
the  entire  system  at  an  early  date.  After  the  circuit  structure  is  designed  and  modeled, 
the  logic  simulation  of  complex  systems  can  become  very  slow.  The  simulation  can  be 
made  faster  by  replacing  circuit  blocks  with  the  corresponding  behavioral  model.  The 
problem  with  these  design  approaches  is  that  there  is  currently  no  way  to  relate  the  circuit 
structural  model  to  the  abstract  behavioral  model.  Having  a verification  tool  available  in 
the  VLSI  CAD  tool  suite  would  allow  these  models  to  be  related  through  mathematical 
analysis. 

The  hardware  description  languages  (HDL)  used  by  VLSI  CAD  tools  can  provide  the 
link  between  these  tools  and  the  verification  environment.  Engineers  can  design  using 
the  CAD  tool  HDL  and  this  description  can  be  automatically  translated  for  use  in  the 
verification  tool.  This  paper  examines  the  translation  of  logic  signal  representations  from 
the  BOLT  (Block  Oriented  Logic  Translator)  HDL,  used  in  the  NOVA  simulation  engine, 
to  the  HOL  theorem  proving  system. 


10.2.2 


2 HOL 

HOL  is  a general  theorem  proving  system  developed  at  the  University  of  Cambridge  [4,6] 
based  on  Church’s  theory  of  simple  types,  or  higher-order  logic.  Higher-order  logic  is  suit- 
able for  specifying  all  aspects  of  hardware,  including  both  structure  and  behavior  [6,8]. 
In  using  higher-order  logic,  predicates  are  defined  to  represent  both  circuit  primitives  and 
behavioral  definitions  [4].  First-order  logic  is  well  suited  to  represent  simple  combinational 
circuits  but  not  sequential  circuits.  In  higher-order  logic,  variables  are  allowed  to  range 
over  functions  and  predicates  which  make  it  suitable  for  representing  sequential  circuit 
behavior  [8].  HOL  is  not  an  automated  theorem  prover  but  is  more  than  simply  a proof 
checker,  falling  somewhere  between  these  two  extremes.  Translation  from  BOLT  descrip- 
tions to  HOL  predicates  requires  that  HOL  primitives  be  defined  to  correspond  to  the 

BOLT  circuit  representations.  . 

Symbols  in  HOL  are  represented  by  strings  of  ASCII  characters.  Conjunction,  dis- 
junction, negation,  implication,  and  equality  are  represented  by  /\,  \/,  ",  ■■>»  and 
« respectively.  Universal  quantification  (for  all)  is  symbolized  ! and  existential  quan- 
tification (there  exists)  is  ?•  The  function  composition  operator  is  o and  the  conditional 
expression  “if  a then  b else  c”  is  symbolized  a =>  b I c. 


3 Logic  States  and  Strengths 

Few  modern  VLSI  circuits  are  designed  using  only  classical  logic  gates  [3,10].  In  designs 
using  pass-transistor,  tri-state,  and  pre-charge  logic,  it  is  common  for  circuit  node?  to  be 
driven  from  multiple  circuit  elements.  These  multiple  drivers  are  designed  to  have  differing 
drive  strengths  in  order  for  one  to  dominate  over  another  in  cases  of  contention.  The  drive 
strength  can  be  considered  to  be  closely  related  to  current  drive  (charge  sourcing)  capability 
[7  2]  The  signal  values  represented  in  the  NOVA  simulation  engine  are  an  extension  of 
Bryant’s  lattice  theoretic  approach  [7,11].  In  the  lattice  theoretic  approach  tfee  elements 
in  the  domain  of  signal  values  represent  the  combination  of  logic  state,  from  the  set  True, 
False,  and  Unknown;  and  a signal  strength.  These  signal  values  form  a partially  ordered 
set  with  their  order  based  on  strength  dominance  when  circuit  output  values  are  combined. 

While  Bryant  later  abandoned  the  lattice  theoretic  approach  [2]  stating  “while  this 
approach  at  first  seems  very  elegant,  it  cannot  adequately  describe  the  effects  of  transistors 
in  the  X (Unknown)  state,  ” Cameron  and  Shovic  have  shown  that  the  problem  with  the 
Unknown  state  can  be  corrected  by  extending  the  domain  of  signal  values  to  include  some 
degree  of  strength  indeterminacy  [3].  Thus,  the  signal  values  are  extended  to  represent 

both  logic  states  and  a range  of  signal  strength.  , 

The  Unknown  state  can  be  the  result  of  a node  connected  to  two  drivers,  one  driving 
to  a True  and  the  other  driving  to  a False,  neither  driver  having  sufficient  strength  to 
dominate  the  other;  or  simply  a node  whose  voltage  is  not  yet  known.  Combining  the  cases 
of  “invalid”  logic  level  and  “valid  but  not  known”  into  a single  Unknown  state  simplifies 
the  simulation  algorithm  but  may  make  the  simulator  pessimistic  since  it  will  propagate 
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the  Unknown  state  when  resolving  some  circuit  nodes[2]. 

We  refer  to  the  combination  of  state  and  strength  information  as  STATES.  The  STATES 
representation  presented  here  is  consistent  with  that  presented  in  [3,10]  except  the  total 
number  of  strengths  N,  is  extended  to  include  a weakest  strength,  Nil,  which  represents  a 
node  that  is  disconnected  from  all  charge  sources.  By  definition,  a signal  being  driven  by 
the  Nil  strength  must  be  at  the  Unknown  state. 

3.1  Representation  of  STATES 

Given  the  set  of  states  True,  False,  and  Unknown  and  a fully  ordered  set  of  strengths 
<7i,  <t3,  . . .,  and  <tn  we  can  define  STATES.  The  STATES  corresponding  to  the  states  True  and 
False  are  represented  as  a triple  Kbd  where: 

K is  1 or  0 representing  the  logic  state  True  or  False; 

bd  represents  a indeterminate  range  of  strengths  where: 

b is  the  strongest  possible  strength  (<Tj  < b < which  sets  a lower  bound  on 

the  strength  of  a signal  that  can  overdrive  this  state; 

d is  the  weakest  possible  strength  (b  < d < orN_x ) which  sets  a upper  bound  on  the 
strength  of  a signed  that  this  state  can  overdrive. 

The  STATES  corresponding  to  the  Unknown  state  are  represented  as  a triple  Xpq  where: 
X represents  the  Unknown  state; 

p is  the  strongest  possible  strength  driving  toward  0 (o’!  < p < <7^-1)  which  sets  a lower 
bound  on  the  strength  of  a signal  that  can  overdrive  this  state  to  a 1; 

cj  is  the  strongest  possible  strength  driving  toward  1 ^ q < cr which  sets  a lower 

bound  on  the  strength  of  a signal  that  can  overdrive  this  state  to  a 0. 

3.2  The  Number  of  STATES 

For  N strengths  the  number  of  True  and  False  STATES  is: 

TF.STAT ES(N)  = 2 ((N  - 1)  + (N  - 2)  + . . . + 1)  = (N  - 1 )(N)  (1) 

For  the  Unknown  state  the  number  of  STATES  is: 

X.STATES(N)  = (N-1)2  + 1 (2) 

The  plus  one  term  in  equation  (2)  represents  the  combination  of  Unknown  state  and 
weakest  strength,  aN  = Nil.  This  STATE  is  referred  to  as  Nil.  Thus,  the  total  number  of 
STATES  for  N strengths  is  equal  to: 


TOTAL.STATES(N)  = 2N2-ZN  + 2 


(3) 
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Figure  1:  Base  Case  Signal  Lattice  (N=2) 

4 STATES  Theory 

A complicated  algorithm  for  determining  the  result  of  combining  STATES  is  presented 
in  [3].  This  algorithm  is  not  satisfactory  for  use  in  HOL.  We  have  developed  a lattice 
that  describes  the  result  of  joining  two  signals.  In  this  lattice  theoretic  approach  to  signal 
strengths,  the  join  (least  upper  bound)  operation  represents  the  resolution  of  contending 
circuit  elements  [11]. 

The  lattice  structure  is  described  through  the  notion  of  immediate  superiors  or  covers. 
For  two  elements,  a and  b of  a partially  ordered  set,  a covers  b if  and  only  if  a > 6 and 
there  exists  no  element  x of  the  partially  ordered  set  such  that  a > x > 6 [1].  A list  of  all 
of  the  elements  and  covers  completely  describe  a lattice.  The  covers  can  also  be  used  to 
define  a graph  of  the  lattice.  The  vertices  of  the  graph  are  the  elements  and  the  segments 
of  the  graph  represent  the  covers.  If  the  graph  is  drawn  such  that  whenever  x covers  y,  the 
vertex  x is  higher  than  the  vertex  y,  it  is  called  a KHasse  diagram  of  the  lattice  [l]. 


4.1  Defining  STATES  Lattice  Structure 

Given  the  base  case  N = 2 (JV  = 1 is  a trivial  case  of  one  single  STATE,  Nil)  there  are  four 
STATES  and  no  strength  indeterminacy,  meaning  there  is  only  a single  value  (<ri)  within 
the  range  of  possible  strengths.  There  are  four  covers  and  the  lattice  Hasse  diagram  is  as 
presented  in  [7,11],  a simple  diamond  (Figure  1). 

To  extend  a N strength  Hasse  diagram  (lattice)  to  N -f  1 strengths: 

1.  Add  three  STATES  and  four  covers  to  form  a new  diamond  at  the  bottom  of  the  N 

strength  diagram  by  replacing  Nil  with  adding  0 <tn(Tn  and  1 <Tn<tn  each 

covered  by  Xonvn  and  placing  Nil  at  the  bottom  of  the  diagram  covered  by  both 

” ' - -...  , ,r-  _ - - ..  ......... 

OaffON  and  Xcr^crtf.  — • 

2.  For  each  M = N to  2,  by  -1,  add  the  following  STATES  and  covers: 
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(a)  X(Tm-\<tn  covered  by  Qo’m-iO'n-i  and  covering  Xctm&n 

(b)  Xcth<tm-\  covered  by  and  covering  Xctm^n 

(c)  covered  by  XaM-i&N  and  covering  0 (TmO’n 

(d)  Ictm-i^n  covered  by  X<tn<Tm-\  and  covering  1 <tm&n 

4.2  The  Number  of  Covers 

The  total  number  of  covers  for  N strengths  is  equal  to: 

COVERS{N)  = 4 N2  - 10 N + 8 (4) 

4.3  The  Lattice  Structure  for  NOVA 

The  NOVA  simulation  engine  and  BOLT  HDL  have  been  selected  for  this  research  so  that 
we  may  have  access  to  commercial- scale  designs  written  by  nonacademic  VLSI  designers 
while  a translation  tool  to  HOL  is  developed.  In  NOVA,  N — 4 and  crj  = a (active),  (T2  = r 
(resistive),  <r3  = / (float)  and  cr4  = Nil.  Note  that  float  > Nil  and  can  be  used  to  represent 
signal  levels  at  charged  capacitive  nodes.  For  N = 4,  equation  (3)  yields  22  STATES  and 
equation  (4)  yields  32  covers.  The  Hasse  diagram  for  the  STATES  and  covers  for  NOVA 
is  shown  in  figure  2.  In  addition  to  identifying  the  list  of  covers  required  to  define  the 
lattice  structure  in  the  verification  logic,  the  Hasse  diagram  also  provides  a quick,  visual 
understanding  of  the  resolution  of  joined  STATES. 

5 Implementing  STATES  in  HOL 

The  HOL  system  includes  a type  definition  package  that  allows  the  user  to  define  new 
types  and  prove  theorems  about  essential  properties  of  the  new  type.  The  type  package 
automatically  carries  out  much  of  the  necessary  formal  proof  required  for  a new  type 
definition.  Theorems  about  the  new  type  are  proven,  rather  than  simply  postulating 
axioms  for  the  new  type,  in  order  to  avoid  the  introduction  of  inconsistency  into  the  logic 
[9].  A new  type  for  signal  values,  called  strength,  is  defined  in  HOL  by  enumeration  of  all 
of  the  STATES.  Properties  proven  about  the  new  type  include  each  value  being  distinct,  an 
induction  theorem,  and  a case  analysis  (perfect  induction)  theorem.  The  STATES  lattice 
is  defined  by  enumeration  of  the  covers  and  the  function  join  is  defined  to  be  the  least 
uPPer  bound.  Once  the  join  function  definition  is  complete,  consistency  of  proofs  that 
utilize  join  are  insured  by  formal  proof  of  the  lattice  theoretic  obligations  [11]  for  the  join 
operation.  These  obligations  are: 

1.  Idempotence.  For  all  a STATES,  join  a a = a. 

2.  Commutativity.  For  all  a and  b STATES,  join  a b = join  b a. 

3.  Associativity.  For  all  a,  b and  c STATES,  join  a (join  b c)  = join  (join  a b ) c. 
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Figure  3:  Memory  Cell  Schematic  Diagram 
4.  Existence  of  bottom.  For  all  a STATES,  join  a Nil  = a. 

5.1  STATES  Abstraction  Function 

Typically  a behavioral  specification  is  defined  in  terms  of  boolean  values.  An  abstraction 
function  is  required  to  relate  STATES,  used  in  structural  specifications,  to  boolean  values. 

STATES.ABS  sig  = ((sig=laa)\/(sig=lar)\/(sig=lrr)\/ 

(sig=laf )\/(sig=lrf )\/(sig=lff))  =>  T I 
((sig=0aa)\/(sig=0ar)\/(sig=0rr)\/ 
(sig=0af)\/(sig=0rf)\/(sig=0ff))  =>  F | 

ARB 

The  Unknown  STATES  are  assigned  a value  ARB,  defined  to  be  an  arbitrarily  chosen  boolean 
value. 

6 Theory  Demonstration 

A static  memory  circuit  cell,  implemented  with  gate  level  and  pass  transistor  primitives,  is 
used  to  demonstrate  the  STATES  theory  (Figure  3).  Without  a signal  value  representation 
that  realizes  output  dominance  this  circuit  cannot  be  correctly  modeled.  Fundamental  to 
the  operation  of  this  circuit  is  that  the  output  strength  of  pass-transistor  Ml  dominates 
the  output  of  inverter  Inv2  to  force  node  nl  to  the  state  of  the  input  d while  the  gate  g 
is  True  (high  voltage).  The  feedback  inverter  Inv2  acts  to  store  the  state,  by  dominating 
the  pass-transistor  after  the  gate  goes  False,  turning  the  transistor  off. 

6.1  The  Circuit  Primitives 

The  memory  cell  structure  includes  three  predicate  definitions;  a pass- transistor  element, 
inverter  elements,  and  the  JOIN  operation.  Time  is  represented  as  a number  (num)  stream 
and  circuit  signals  are  defined  to  be  functions  of  type  mim  to  type  strength. 
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The  behavioral  model  of  the  cell  is  not  defined  for  the  gate  input  being  at  an  unknown 
state.  A simplified  pass-transistor  model  is  used  that  defines  that  the  signal  at  the  drain 
is  equal  to  the  signal  at  the  source  if  the  gate  is  True,  else  it  is  Nil. 

NTRAN  (g.s.d)  - 
! t. 

d t = ((g  t =laa)\/(g  t -lar)\/ 

(g  t =lrr)\/(g  t =laf)\/ 

(g  t =lrf)\/(g  t =lff))  =>  s t | 

Nil 

The  inverter  predicate  has  five  arguments.  The  first  three  arguments  are  of  type 
strength  and  define  the  possible  inverter  output  STATES.  The  first  is  the  output  STATES 
for  a True  output,  the  second  for  a False  output,  and  the  third  the  Unknown  state  output. 
The  Unknown  output  value  is  derived  from  the  strongest  True  and  False  strengths.  The 
fourth  and  fifth  arguments  are  signal  functions  of  type  num  to  type  strength.  The  fourth 
is  the  inverter  input  and  the  fifth  is  the  output. 

INV  Is  Os  Xs  (in, out)  = 

! t. 

out  t “(((in  t =laa)\/(in  t *lar)\/ 

(in  t =lrr)\/ (in  t -laf )\/ 

(in  t =lrf )\/ (in  t -Iff))  =>  Os  I 
((in  t =0aa)\/(in  t =0ar)\/ 

(in  t =0rr)\/(in  t =0af)\/ 

(in  t =0rf)\/(in  t -Off))  =>  Is  I 

Xs  ) 


6.2  JOIN 


The  JOIN  predicate  performs  two  operations.  It  determines  the  resulting  signal  value  of 
combining  circuit  outputs  by  applying  the  join  function.  The  second  operation  is  related 
to  the  sequential  behavior  of  a charge  storage  node.  The  capacitance  of  a node  may  result 
in  a time  delay  when  the  node  is  driven  to  a new  signal  level.  The  delay  time  increases  as 
the  strength  of  the  driving  signal  decreases.  This  sequential  behavior  is  modeled  as  having 
a variable  delay,  whose  length  is  based  on  the  strength  of  the  join  function  result.  [5,7]. 
The  Hasse  diagram  shows  the  relative  strength  of  STATES  and  can  be  used  to  abstract 
the  delay  values  for  individual  STATES  by  segregating  them  into  horizontal  bands  on  the 
diagram.  All  STATES  within  a common  band  have  the  same  delay  and  the  delay  is  longer 
for  lower  bands.  For  cases  where  it  is  desired  to  model  different  delays  for  rise  and  fall 
times  the  diagram  can  be  segregated  right  from  left  also. 

The  demonstration  cell  is  modeled  as  having  two  possible  delays.  When  the  pass- 
transistor  is  turned  on,  the  storage  node  at  the  join  is  driven  by  an  active  strength  and 
the  delay  is  defined  to  be  zero.  When  the  pass-transistor  is  turned  off,  the  storage  node 
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is  driven  by  the  resistive  strength  of  the  feed-back  inverter  and  the  delay  is  defined  to  be 
one. 


JOIN  (s *, s s :num-> strength)  = 

! t.  let  sig  = join  (s’  t)  (s’’  t)  in 
((sig  = Oaa)  \/ 

(sig  = laa)  \/ 

(sig  = Xaa)  \/ 

(sig  = Xar)  \/ 

(sig  = Xra) ) =>  (s  t = sig)  I 

(s  (t+1)  = sig) 


6.3  The  Structural  Description 

A BOLT  description  of  the  cell  is: 

MODULE  q .CELL  G D; 

BEGIN 

N1  . NTRAN  G D; 

q . INVR  N1 ; 

N1  .INVR  q (STR=’RR>); 

END; 

The  STR=’RR’  parameter  in  the  second  INVR  invocation  defines  the  output  strength  of  that 
inverter  as  resistive.  The  default  value  used  for  the  first  invocation  is  active.  The  HOL 
structural  specification  of  the  cell  is: 

cell.IMP  (d,g,q)  = 

? nl  nl ’ nl ’ ’ :num-> strength  . 

NTRAN  (g, d,nl ’ ) /\ 

INV  laa  Oaa  Xaa  (nl,q)  /\ 

INV  lrr  Orr  Xrr  (q.nl*’)  /\ 

JOIN  (nl’ ,nl» ’ ,nl) 

6.4  The  Behavioral  Description 

When  the  gate  of  the  pass-transistor  is  True  the  cell  is  writing  the  input  and  the  output, 
q,  follows  as  the  inverse  of  d.  When  the  gate  is  False  the  cell  is  storing  the  previous  data. 
The  HOL  behavioral  description  is: 
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cell, SPEC  Cd,g,q)  * 

!t, 

Cg  t)  “>  Cq  t = ~d  t)  | 

Cq  (t+1)  * q t) 

6.5  The  Cell  Verification 

Because  the  operation  of  the  cell  requires  that  the  output  of  the  pass-transistor  dominate 
the  resistive  strength  output  of  INV2  and  the  pass-transistor  is  not  an  amplifier,  there  is  a 
validity  condition  that  the  signal  applied  to  input  d must  be  stronger  than  resistive.  This 
condition  is  required  for  proper  circuit  operation  and  is  not  simply  a verification  artifact. 

Valid!  (d)  * 

! t, 

(d  t ' laa  ) \/  (d  t = Oaa) 

Because  the  behavior  of  the  cell  is  defined  only  for  boolean  value  signals  at  the  gate, 
there  is  a validity  condition  for  the  gate  that  it  be  either  a True  or  False  state.  This 
condition  yields  a 12  way  case  analysis  in  the  proof,  but  is  easily  reduced  to  needing  to 
consider  only  the  two  cases  of  writing  and  storing. 

Valid2  Cg)  = 

! t. 

Cg  t * laa)  \/  (g  t = lar)  \/  (g  t = lrr)  \/ 

Cg  t «*  Xaf)  \/  (g  t = Irf)  \/  Cg  t = Iff)  \/ 

(g  t * Oaa)  \/  Cg  t = Oar)  \/  (g  t = Orr)  \/ 

(g  t * Oaf)  \/  Cg  t = Orf)  \/  Cg  t = Off) 

The  verification  of  the  cell  entails  proving  that  the  cell  structural  description  and 
validity  conditions  logically  imply  the  behavioral  specification.  The  theorem  proven  is: 

I-  CValidl  Cd)  /\  Valid2  Cg)  A call,  IMP  Cd,g,q))>=> 

cell_SPECC STATES _ABS  o d,  STATES _ABS  o g, STATES, ABS  o q)  , 

7 Future  Work 

The  theory  of  signal  lattices  presented  in  this  paper  is  an  important  first  step  in  finking 
BOLT  and  HOL.  Future  steps  include: 

1.  Developing  and  validating  a set  of  HOL  theories  corresponding  to  the  primitive  com- 
ponents in  the  NOVA  library. 

2.  Writing  a formal  semantics  for  BOLT. 

3.  Embedding  BOLT’s  formal  semantics  in  HOL. 

These  steps  do  not  include  work  on  translating  NOVA  behavioral  models  to  HOL,  a diffi- 
cult, but  necessary  task. 
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8 Conclusion 

The  first  step  in  the  integration  of  CAD  VLSI  design  tools  with  a verification  tool  is  the 
translation  of  the  HDL  representations  into  the  verification  logic.  A verification  logic  the- 
ory has  been  presented  for  reasoning  about  an  indeterministic  signal  value  representation 
based  on  a lattice  approach.  This  work  is  necessary  because  the  previous  algorithm  for 
joining  indeterministic  signal  values  is  not  suitable  for  a verification  logic  environment. 
The  suitability  of  the  lattice  approach  is  demonstrated  through  the  verification  of  a static 
memory  cell.  The  lattice  diagram  presented  also  quickly  provides  to  users  the  result  of 
combining  different  valued  indeterminate  signals. 
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State  Machines 
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for  VLSI  System  Design 
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Moscow,  Idaho  83843 

Abstract  - A formal  specification  of  VLSI  state  machines  based  on  a sequence 
invariant  architecture  is  presented.  The  behavioral  description  represents  a 
logical  description  of  any  synchronous  state  machine.  The  structural  specifica- 
tion represents  an  adoptive  architecture  developed  using  VLSI  technology  to 
implement  the  state  machine.  This  specification  becomes  a tool  for  future  ver- 
ification and  specification  of  state  machines  using  dedicated  machines  and/or 
alternative  technologies.  The  verification  of  the  state  machine  is  done  in  HOL, 
a theorem  proving  system.  Using  HOL,  the  verification  shows  analytically  that 
the  circuit  structure  has  the  desired  behavior. 

1 Introduction 

With  the  advancement  of  integrated  circuit  technology,  the  need  for  new  methods  of  en- 
suring design  correctness  is  becoming  more  prominent.  Simulation  remains  the  dominant 
method  in  use,  but,  recently,  interest  has  grown  in  using  formal  logical  analysis  to  show 
the  correctness  of  digital  systems. 

Formal  verification  of  hardware  involves  using  theorem-proving  techniques  to  verify 
that  a stated  behavioral  definition  of  a circuit  is  a logical  consequence  of  the  structural 
description  of  the  circuit,  i.e.,  proving  that  the  structure  of  the  circuit  forces  it  to  behave 
as  stated.  This  paper  presents  a formed  specification  and  verification  of  a general  state 
machine.  The  specification  describes  the  behavior  and  structure  of  the  state  machine.  The 
behavioral  specification  is  a logical  representation  of  a state  machine.  Using  a particular 
design  in  VLSI  technology,  a structural  description  based  on  the  Sequence  Invariant  Ar- 
chitecture is  described.  The  structure  clearly  specifies  how  components  are  connected  and 
built  to  achieve  the  operation  of  the  state  machine.  The  verification  shows,  by  analysis, 
that  the  structural  specification  implies  the  behavioral  specification  using  a theorem  prov- 
ing system  known  as  HOL  [1].  Hence,  the  VLSI  architecture  is  capable  of  implementing 
any  state  machine. 


2 The  HOL  System 

As  described  by  Birtwistle  and  Subrahmanyam  [3],  the  HOL  system  (‘HOL’  standing  for 
‘higher  order  logic’)  is  designed  to  facilitate  the  interactive  generation  of  formal  proofs.  A 
logic  in  which  problems  can  be  expressed  is  interfaced  to  a programming  language  in  which 


10.3.2 


proof  procedures  and  strategies  can  be  encoded.  The  combination  enables  deduction  in 
logic  (in  the  sense  of  chains  of  primitive  inference  steps)  to  be  produced  by  invocation  o 

programming  constructs  at  a higher  level  of  abstractness. 

The  logic  part  of  HOL  is  conventional  higher-order  logic.  New  types,  constants  and 
axioms  can  be  introduced  by  the  user,  and  organized  in  logic  theories.  The  programming 
language  of  HOL  is  ML  (for  ‘meta-language’).  The  type  discipline  of  ML  ensures  that  the 
only  way  to  create  theorems  in  the  object  logic  is  by  performing  proofs;  theorems  have 
the  ML  type  thm,  objects  of  which  can  only  be  constructed  by  the  application  of  interface 

rules  to  other  theorems  or  axioms. 


3 Sequential  Circuits  Overview 

Sequential  circuits  are  categorized  as  either  synchronous  or  asynchronous,  depending  upon 
whether  or  not  the  behavior  of  the  circuit  is  clocked  at  discrete  instants  of  times  The 
operation  of  synchronous  sequential  circuits  (the  topic  of  this  paper)  is  controlled  by  a 
synchronizing  pulse  signal  called  a clock  pulse  or  simply  a clock. 

Sequential  machines  are  usually  represented  by  state  diagrams  or  state  tables  (flow 
tables).  A flow  table  has  a row  corresponding  to  every  internal  state  of  the  machine  and 
a column  corresponding  to  every  possible  input.  The  entry  in  row  <?,  and  column  m 
represents  the  next  state  produced  if  Im  is  applied  when  the  machine  is  in  state  *.  Table 

1 shows  a flow  table  for  an  arbitrary  circuit  with  six-states  and  three  inputs.  Once  the 

1 1 A state 


A OUO  ii  u k ” ~ . 

flow  table  is  constructed  for  a given  circuit,  a state  assignment  is 
assignment  is  the  encoding  of  the  states  of  the  flow  table  with  the  internal  state  variables 
ft/i  Vi  V ) Table  2 shows  the  state  assignment  and  the  next  state  entries  assignmen 
for  Table  1.  Finally,  the  next  state  equations  are  derived  from  the  state  assignment  using 
Karnaugh  map  techniques.  We  can  also  derive  an  equation  that  describes  the  outpu 

behavior  from  the  flow  table. 

3.1  SISM  Overview 

An  adaptive  hardware  architecture  has  been  developed  [2],  that  enables  the  designer  to 
design  any  sequential  circuit  based  on  the  width  of  the  machine  w,  and  the  number  of  con- 
trol inputs  J,  without  a knowledge  about  the  sequence  to  be  incorporated.  This  adaptive 
architecture  is  called  a Sequence  Invariant  State  Machine  (SISM)  design. 

With  the  SISM  realization,  any  flow  table  can  be  implemented  without  a change  m the 
hardware  configuration.  That  is  given  »,  and  I,  a hardware  circuit  is  easily  derived  that 
can  implement  any  state  machine  that  has  a maximum  of  I control  inputs,  and  2 internal 

states. 


3.2  Architecture  And  Operation 
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Ii 

h 

I3 

A 

C,1 

B,  1 

A,  0 

B 

D,  0 

C,1 

B,  0 

C 

E,  0 

D,  0 

C,  0 

D 

F,  1 

E,  1 

D,  1 

E 

A,  0 

F,  0 

E,  1 

F 

B,  0 

A,  1 

F,  1 

Table  1:  General  6-states,  3-input  flow  table. 
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1 

0 

1 

F 

0 

0 

1,0 

0 

0 

0, 

1 

1 

0 

1, 

1 

1 

1 

0 

G 

0 

0 

0,  0 

0 

0 

0, 

0 

0 

0 

0, 

0 

1 

1 

1 

H 

0 

0 

0,  0 

0 

0 

0, 

0 

0 

0 

0, 

0 

Table  2:  State  Assignment  for  Table  1. 


Figure  1 shows  a general  SISM  architecture,  this  architecture  can  be  used  to  implement 
one  of  the  next  state  variables  in  Table  2. 


I y 


Figure  1:  General  SISM  Architecture. 

The  architecture  contains  the  following  components: 

• The  destination  state  codes  are  derived  from  the  next  state  entries  in  the  state 
assignment  table  by  inspection.  For  example,  the  destination  state  codes  for  state  B 
and  state  variable  y;  are  the  next  state  bits  Y{  associated  with  state  B.  Therefore,  the 
destination  state  codes  for  state  B are  (000,110,101)  under  control  inputs  (7X;  J2;I3) 
and  variables  ( J/i  ? 2/2  > 2/3 ) respectively.  One  way  to  implement  those  codes  is  to  use 
constants,  that  is,  presenting  ones  and  zeros  at  the  input  of  the  structure.  Also,  they 
could  be  programmed  into  the  structure  using  various  memory  devices  [3]. 
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• The  input  switch  matrix  is  combinational  logic  that  produces  all  the  possible  next 
state  entries  for  each  current  control  input. 

• The  next  state  logic  consists  of  an  independent  path  for  each  of  the  present  states  in 
the  state  assignment  flow  table. 

• The  storage  element  is  a D-FF  that  preserves  the  present  state. 

The  operation  of  the  architecture  is  as  follows.  The  current  control  input  selects  the  set 
of  potential  next  states  that  the  circuit  can  assume  (input  column  in  the  flow  table).  The 
present  state  variables  select  the  exact  next  state  (row  in  the  flow  table)  that  the  circuit 
will  assume  at  the  next  clock  pulse. 

4 Formal  Specification 

The  previous  section  presented  a description  of  the  S1SM  architecture  and  operation.  This 
section  presents  the  formal  specification  of  the  SISM  architecture.  The  behavioral  specifi- 
cation is  introduced  first  and  then  a structural  implementation  is  described. 


W G CS(T) 
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SM  DEVICE 

CS(T+1) 

1 ► 
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Figure  2:  General  state  machine  device 


4.1  The  Behavioral  Specification 

A general  behavioral  description  of  all  state  machines  can  be  specified  by  defining  a pred- 
icate that  relates  the  inputs  and  outputs  and  defines  the  state  transition.  Figure  2 shows 
a general  state  machine  device.  The  behavior  of  the  state  machine  device  can  be  specified 
by  a predicate  sism-spec,  that  is  true  only  when  the  combination  of  the  values  of  the 
variables  w,  g,  data,  clr,  Id;  and  the  state  variable  cs  is  one  that  could  occur  on  the  cor- 
responding input  and  output  signals  of  the  device.  The  variables  are  references  to  actual 
signals  and  data  as  explained  below. 
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• V,  U(:numf. 

This  represents  the  width  of  the  state  machine,  i.e.,  the  number  of  next  state  vari- 
ables. 

• 4g\  “0  7ne  “ > 

This  is  the  control  input  to  the  state  machine.  It  is  represented  as  function  associated 
to  time.  That  is  at  time  (t),  the  input  (g)  is  the  control  input  which  is  a number 
from  zero  to  I.  Where  I is  the  maximum  number  of  control  inputs. 

• ‘data’,  “(:  num  — + num  — ► num  — > bool)”. 

This  is  the  destination  state  codes  for  the  entire  state  machine.  It  is  represented  as  a 
function  associated  with  the  width  of  the  state  machine  and  the  list  of  data  for  each 
of  the  next  state  variables. 

• ‘clr’,“(:  time  — > bool)”. 

This  signal  when  enabled  will  forces  the  output  values  to  be  cleared  to  low. 

• ‘Id’,  “(:  time  — > bool)” , 

This  signal  when  enabled  will  load  the  input  data  to  the  D-ff  and  present  it  to  the 
output. 

• ‘cs’,  “(:  time  — > num  — > bool)” . 

This  is  the  current  state  value.  It  is  represented  as  function  associated  to  time.  That 
is  at  time  (t)  this  value  will  enable  one  path  from  the  input  to  the  output. 

The  overall  behavior  of  the  state  machine  is  given  by  the  following  logic  term: 


sism- 

spec  — 

sisra_sp«c  w 

g data  clr  Id 

(cs  :num— >num— >bool)  = 

(V  t:num. 

cb  (t+1)  — (clr  t — ► ZEROS  w | 

Id  t data  (g  t)  (val  w (cs  t))  | 

cs  t))" 

The  predicates  sism-spec  asserts  that  the  relationship  between  those  values  corresponds 
to  the  way  the  state  machine  works  in  practice.  That  is,  the  next  state  of  the  machine  at 
time  (t+1)  is  a function  of  the  value  of  the  data  input  and  the  current  state  at  time  (t). 

4.2  The  Structural  Specification 

An  implementation  of  state  machines  based  on  the  sequence  invariant  architecture  is  pre- 
sented. Using  tools  available  in  HOL  the  structure  of  the  SISM  can  be  described  by 
specifying  high  level  descriptions  of  the  major  pieces  of  the  SISM  device  and  combining 
them  so  that  they  correspond  to  the  actual  structure.  The  structure  of  the  SISM  can  be 
represented  by  a predicate  s ism- imp  with  a definition  as  follows: 
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The  predicate  sism-imp-rec  defines  the  structure  of  the  circuit.  The  predicate  is  de- 
fined recursively  on  its  width  indicating  the  iterative  structure  of  the  circuit.  The  predicate 
is  defined  as  follows: 


(si*m_imp_rec  = 

"(«i*m_imp_r«c  0 9 g data  clr  Id  cs  = block  0 w g data 

clr  Id  cs) 

A 

(sisa_imp_r«e  (n-hl)  w g data  clr  Id  cs  = 

(Csisa_imp_r«c  n w g data  clr  Id  cs)  A 
(block  (n+1)  w g data  clr  Id  cs  )))" 


The  predicate  block  gives  the  structure  of  a single  slice  of  the  circuit.  Block  is  defined 
by  conjoining  the  predicates  that  specify  the  behaviors  of  each  component  with  the  logical 
connective  (A)  and  using  existential  quantification  (3)  to  hide  the  internal  signals.  The 
following  logic  term  describes  block: 

block  = 

block  id  w g data  clr  Id  cs  = 

(3  outi  out 2 . 

(sel  id  v g data  outi  ) A 
(mux  w outi  cs  out 2 ) A 
(d_ff  out 2 Id  clr  (cs  id)))" 

In  this  definition  the  two  internal  lines  (outi;  out2)  are  hidden  from  the  external 
environment  using  the  existential  quantifier  (3).  The  definition  of  block  states  that  the 
values  which  can  appear  on  the  external  inputs  and  outputs  of  the  SISM  device  are  precisely 
those  which  satisfy  the  constraints  imposed  by  the  predicates  modeling  the  three  modules 
from  which  it  is  built.  The  modules  that  are  used  to  define  the  predicate  block  are  explained 
next.  — -•  • ---  --  -----  - 


The  Selector  module  The  selector  module  is  defined  using  predicates  as  a function. 
The  predicates  that  defines  the  behavior  specification  is  a function  as  shown  below, 
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Figure  3:  A block  representing  the  SISM  device 

The  selector  is  a device  that  is  controlled  by  the  control  inputs.  For  each  block  there 
are  2W  selectors.  Hence,  2W  outputs  are  presented  to  the  next  device.  The  data  input 
to  the  selector  are  the  destination  state  codes.  The  outputs  are  all  the  data  selected  by 
the  current  control  input.  Referring  to  the  definition  and  to  Figure  3,  the  selector  has 
three  external  inputs  and  one  internal  output.  Some  of  the  variables  are  described  earlier, 
however  the  new  variables  are  described  as  follow. 

» ‘id’,  “(:num)”. 

This  represents  the  current  block  of  the  state  machine,  i.e.  if  w=3  and  id= 1 then 
the  current  next  state  variable  is  the  first  variable  in  the  SISM  block. 

• ‘out’,  “(:  num  —>  time  — ► bool)”. 

This  function  represents  all  possible  outputs  for  each  next  state  variable  under  the 
current  control  input. 


The  MUX  Module  The  MUX  module  is  a function  that  takes  2W  inputs  and  present 
one  value  to  the  output  based  on  the  current  state.  The  following  predicate  describes  the 
behavior  of  MUX: 


Referring  to  the  definition  and  to  Figure  3,  the  MUX  module  has  two  external  inputs,  one 
internal  input,  and  one  internal  output.  The  internal  inputs  and  outpus  are  described  as 
follows. 
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t ‘input’,  “(:  num  — ► time  — > bool)”. 

This  is  the  data  provided  by  the  previous  module.  It  is  a bit  vector  of  length  2W, 
which  represent  all  possible  next  state  entries. 

» ‘output’,  “(:  time  — > bool)”. 

This  is  the  value  selected  by  the  current  state  as  one  of  the  next  state  variable  at  the 
next  clock  pulse. 

The  O-ff  Module  The  D-ff  module  is  a memory  device  that  present  the  input  to  the 
output  at  the  next  clock  pulse.  The  predicate  that  describes  the  behavior  specification  is 
as  follows: 


d-«= 

\~def  in  Id  clr  q = 

(V  t:time  . q(t-fl) 

= ((clr  t)  — » F | 

(ldt)  — ► in  t | q t)) 

A (q  0 I = F)" 

Referring  to  the  definition  and  to  Figure  3,  the  following  variables  are  defined, 

• ‘in’,“(:  time  — bool)”. 

This  is  the  next  state  variable  provided  by  the  previous  module  to  be  presented  to 
the  output  at  the  next  clock  pulse. 

» ‘q’,  “(:  time  — > bool)”. 

This  is  the  output  value  which  constitute  one  of  the  variables  that  when  combined 
with  the  other  outputs  from  the  other  blocks,  result  in  the  current  state. 


5 Verification 

The  goal  of  the  verification  is  stated  in  logic  as  follows: 


"V  w g data  clr  Id  cs . 

*ism_imp  a g data  clr  Id  cs  =>■ 

sism_spec  s g (DATA_ABS  « data)  clr  Id  (ABS  w cs)" 


The  goal  states  that  the  structural  implementation  implies  the  behavioral  description  of 
the  circuit,  or,  that  the  behavior  follows  from  the  structure.  In  the  goal,  DATA-ABS  and 
ABS  are  two  functions  used  to  abstract  the  signals  u,  data  and  cs  which  are  defined  at  the 
structural  level  to  behavioral  level  signals. 

The  verification  is  approximately  60%  done.  The  proof  is  carried  out  using  induction 
on  the  width  of  the  SISM-  HOL  provides  mechanical  support  for  induction,  rewriting,  case 
analysis  and  other  necessary  proof  techniques. 
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6 Conclusion 

This  paper  presents  the  design  for  a SISM  that  is  being  proven  to  work  correctly.  This 
is  especially  significant  because  the  design  of  the  SISM  is  very  general.  Future  work  will 
entail  tying  the  structural  specification  to  the  actual  circuit  and  using  this  work  to  verify 
specific  state  machines  based  on  the  SISM  design. 
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Ultra  Low  Power  CMOS  Technology 
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Abstract  - This  paper  discusses  the  motivation,  opportunities,  and  prob- 
lems associated  with  implementing  digital  logic  at  very  low  voltages,  including 
the  challenge  of  making  use  of  the  available  real  estate  in  3D  multichip  mod- 
ules, energy  requirements  of  very  large  neural  networks,  energy  optimization 
metrics  and  their  impact  on  system  design,  modeling  problems,  circuit  design 
constraints,  possible  fabrication  process  modifications  to  improve  performance, 
and  barriers  to  practical  implementation. 

1 Introduction 

As  technology  continues  to  scale  into  the  submicron  regime,  massively  parallel  architec- 
tures are  increasingly  being  constrained  by  power  considerations.  Minimizing  the  energy 
per  operation  throughout  the  system  is  assuming  increasing  importance.  We  are  investi- 
gating “Ultra  Low  Power  CMOS”  to  reduce  the  energy  per  operation  in  massively  parallel 
signal  processors,  microsatellites,  and  large  scale  neural  networks.  We  are  investigating 
operating  with  supply  and  threshold  voltages  of  a few  hundred  millivolts  to  reduce  energy 
per  operation  by  a more  than  100  times. 

In  this  paper,  we  show  that  minimum  energy  per  operation  is  achieved  in  the  sub- 
threshold regime,  and  that  the  optimum  performance  is  obtained  when  Vdd  = Vt  and 
Gnd  = Vt  — Vdd.  We  also  show  that  minimum  energy  x time  occurs  when  Vdd  = 3V(.  We 
show  that  Vt  should  be  chosen  such  that  I0n/I0ff  = Id/ a,  where  Id  is  the  logic  depth  and  a 
is  the  activity  ratio,  the  fraction  of  gates  which  are  switching  at  any  given  time.  We  also 
show  that  Id  = 11  minimizes  energy  in  a 32x32  bit  parallel  multiplier. 

2 Motivation 

The  application  domains  we  are  targeting  include  wideband  spectrometers  requiring  1012 
operations  per  second,  microsatellites  with  lOOmW  power  budgets,  large  scale  neural  net- 
works requiring  1015  connections  per  second  and  lfj  per  connection,  and  small,  massively 
parallel  digital  signal  coprocessors. 

As  an  example,  a single  SBus  slot  in  a Sun  SPARCstation  occupies  about  200cm3,  can 
accommodate  over  2000cm2  of  active  silicon  using  3D  stacked  multichip  module  technology, 
and  has  a power  budget  of  10W  (see  Fig  1).  An  architecture  with  a power  density  of 
2W/cm2  and  40  MIPS  per  chip,  typical  of  modern  microprocessors,  would  dissipate  4KW 
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Figure  1:  3D  MCM  in  an  SBus  slot:  2000  cm3,  10W  max.  Vdd  = 0.7V  permits  10  GIPS. 

if  tiled  over  the  available  area  and  achieve  80  billion  operations  per  second.  Only  5 cm3  of 
silicon  can  be  used  at  10W,  yielding  200  MIPS.  If  the  supply  voltage  is  lowered  to  700mV, 
each  chip  would  dissipate  5mW,  and  the  entire  2000cm3  could  be  used  to  achieve  10  billion 
operations  per  second  at  10W. 


Low  voltage  digital  logic  is  not  new.  Richard  Swanson  described  a lOOmV  CMOS  ring 
oscillator  in  [6].  Eric  Vittoz  discussed  subthreshold  design  techniques  used  in  the  digital 
watch  industry  in  [4].  Carver  Mead  described  a variety  of  subthreshold  analog  circuits 
for  neural  networks  in  [1].  We  believe  that  low  voltage  circuits  can  be  used  effectively  for 
massively  parallel  computation  in  power  constrained  environments,  and  that  lowering  the 
voltage  in  submicron  technologies  has  the  added  benefit  of  maintaining  manageable  signal 
frequencies  at  the  system  level. 


4 Transistor  Current 

The  following  equations  [6,7]  describe  drain  current  as  a function  of  gate  voltage,  as  shown 

in  Fig  2.  . — . - . 
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Figure  2:  Transistor  current  vs  voltage. Current  in  exponential  with  voltage  below  Vt)  and 
quadratic  above  Vt . 


Vgs 


Figure  3:  Model  discontinuity  at  Vg,  = Vt.  The  subthreshold  model  says  Ij,  = knV The 
saturation  model  says  1^,  — \{Vg,  — VJ)2  = 0.  In  the  figure  Vt  — 200mV. 
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subthreshold:  V9,  < Vt\  I0  = knV} 

Vl.-Vt  _V. 

Id.  = he  "Vt  (1  — e *r) 
saturation:  Vt  < Vg,  < Vd,  + Vt 

Id.  = \{v9.  - vty 

linear:  Vd,  + Vt  < Vg, 

Id.  = t(2(Fff.  - Vt)Vd,  - VI) 

where  V9,  is  the  gate-source  voltage,  Vt  is  the  threshold  voltage,  Idt  is  the  drain  current,  k 
is  the  transconductance  in  A/V2,  n is  the  gate  coupling  coefficient,  usually  around  0.7,  Vt 
is  the  thermal  voltage,  0.026V,  and  I0  is  the  current  at  V9,  = Vt. 

Note  the  exponential  dependence  of  current  on  voltage  below  Vt , and  the  quadratic 
dependence  above  Vt . These  equations  do  a poor  job  of  modeling  behavior  in  the  neigh- 
borhood of  Vt  (see  Fig  3). 

relative  performance  vs  supply  and  threshold  voltage 


Performance  can  be  approximated  when  the  supply  voltage  is  over  threshold  by 

f = I/Q  = l(V  - V,)V(CT). 

where  / is  the  clock  frequency,  k is  transconductance,  and  C is  the  capacitance  being 
switched. 


5 Optimum  Logic  Depth 
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Figure  5:  Optimum  logic  depth  of  a 32x32  bit  tree  multiplier.  For  a given  Id,  the  supply 
voltage  is  lowered  to  match  the  unpiped  throughput.  Minimum  power  consumption  occurs 

at  U = 11.  Latch  energy  increases  as  Id  decreases,  eventually  exceeding  logic  energy,  which 
decreases  as  Id  decreases. 


Figure  6:  Relative  area  vs  logic  depth  in  a 32x32  bit  multiplier.  The  area  penalty  at 
Id  = 11  is  37%. 
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We  found  the  optimum  logic  depth  in  a 32  x 32  bit  tree  multiplier  by  reducing  the  supply 
voltage  to  keep  the  throughput  constant  (see  Fig  5).  We  also  found  the  area  penalty  using 
this  approach  (see  Fig.  6).  Id  = 11  is  close  to  the  propagation  delay  through  a 4:2  adder 
[2]- 

0 Minimum  Energy 

The  current  available  to  switch  a node  is  the  difference  between  the  current  of  the  ON 
device  and  the  leakage  current  of  the  OFF  device.  In  standard  CMOS,  Vt  is  so  high  that 
I0jj  can  be  ignored,  but  in  low  voltage  applications  it  can  be  an  appreciable  fraction  of 

Ion* 

, _ Q c.v  c,v 

* " i i u-i.fi 

*oJt 

E.,  = \aC,V 1 

E = E"  + Edc  = tg.V1  (a  + x) 

toff 

E is  minimum  when  Ion/4//  is  maximum.  Referring  to  Fig  2,  is  maximum  and 

constant  in  the  subthreshold  region. 

In  the  subthreshold  region,  if  Vj,  = V — Vf,i  — Via , then  Ion/Io}}  — e — 

ev/(nVT)^  so  g depends  only  on  V = VJ,*  — Vj0.  Therefore,  for  a given  Vdd,  energy  is  constant 

in  the  subthreshold  region.  For  maximum  performance  at  minimum  energy,  set  VM  = Vt 

and  Via  = Vt-  Vdd-  . . 

DC  energy  rises  exponentially  as  Vdd  decreases.  AC  energy  rises  quadratically  as  Vdd 

increases.  For  optimum  Vt, 

Pac  = aCV2  f 
Pdc  = loffV 
1^  = IdCVf 

If  Poc  = Pdc  and  Vdd  = Vt,  then 

/„/ h„  = ld/a  = er‘'W 
Vt  = nVT  \n(Ion/ hff) 

Figs  7 and  8 show  energy  vs  Vdd.  Table  1 lists  the  voltages  and  energies  at  the  global 
minima. 
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7 Minimum  Energy  x Time 


1 /(energy  x time)  vs  supply  and  threshold  voltage 


The  minimum  energy  solution  is  quite  slow.  Performance  should  improve  dramatically  m 
deep  submicron  and  with  low  voltage  process  optimizations.  An  alternative  approach  is  to 
minimize  energy  x time.  If  we  assume  transistors  operate  mostly  in  saturation,  then 

Et  = V2Q/I  - V3/(V  - Vt)2 
Etmi„  = \V  <dV  = m 

Fig  9 shows  a maximum  at  3Vt  which  grows  much  more  pronounced  at  low  voltage. 

8 Circuit  Design  Constraints 

A number  of  interesting  circuit  design  constraints  appear  when  leakage  currents  are  large, 
and  when  the  dependence  of  current  on  voltage  is  exponential.  Three  constraints  we  have 

observed  to  date: 

• Dynamic  circuits  are  difficult  to  manage,  A minimum  size  transistor  wU  have  a 
leakage  current  of  about  InA  at  Vt  = 160mV.  A dynamic  storage  node  with  lOOfF 
of  capacitance  will  hold  50fC  of  charge  at  Vdd=0.5V.  A change  of  lOOmV  requires 
movement  of  lOfC.  lOfC/lnA  = lOusec. 

• Exponential  dependence  of  current  on  voltage  makes  pass  transistor  logic  difficult  to 
use.  nfets  cannot  pass  ones  and  pfets  cannot  pas  zeros.  In  particular,  using  nfets  as 
access  transistors  for  static  latches  does  not  work. 
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parameter 

negative 

positive 

reduce  Xj 

increase  Rs,  Rd 

decrease 
cjsw,cgsoy  cgdo 

reduce  Tox 

decrease  Vg,max 
(gate-src  breakdown) 
increase  Cox 
(increase  energy) 

increase  k 
decrease  n 

reduce  Nb 

decrease  Vdsmax 
(punchthrough) 

increase  no 
decrease 
cjy  cjsWy  n 

reduce  Nq 

increase  Rq 

decrease  Vt 

reduce  No 

increase  i?5,  Rd 

decrease 
cj}  cjsw 

Table  2:  Process  optimization  opportunities. 

• Fully  static  logic  appears  to  work  well.  Transmission  gate  latches  work  nicely.  SRAM 
seems  to  work  well,  since  one  of  the  bitlines  will  be  pulling  down  on  a write. 

9 Process  Optimization 

The  opportunity  exists  to  improve  performance  by  optimizing  fabrication  processes  for 
low  voltage  operation.  Carrier  mobility  degrades  significantly  in  submicron  processes  as 
channel  doping  is  increased  to  prevent  punchthrough  in  the  presence  of  strong  electric 
fields.  Reduced  voltage  operation  results  in  weaker  fields,  permitting  lower  channel  doping 
which  results  in  higher  carrier  mobility  and  increased  transconductance. 

Reduced  voltage  operation  also  permits  lower  diffusion  doping,  since  higher  diffusion 
resistance  will  not  impact  circuit  performance  due  to  reduced  transistor  drain  current. 
This  reduces  diffusion  capacitance  to  a negligible  fraction  of  gate  capacitance.  The  only 
drawback  of  reducing  diffusion  doping  is  that  lateral  diffusion  is  reduced,  increasing  the 
effective  channel  length.  This  is  partially  offset  by  the  reduced  Miller  effect  since  the  gate- 
drain  overlap  capacitance  is  reduced.  Table  2 summarizes  the  impact  of  various  process 
modifications  on  energy  and  performance. 

While  a lower  bound  of  60mV/decade  is  achievable  at  room  temperature  ( dV  — 
nVrln(  10)  with  n = 1),  dV  is  more  typically  80mV/decade  in  2 p CMOS  and  90mV/decade 
in  0.8/i  CMOS.  Tox/d0  can  be  reduced  by  reducing  Nb , since  d0  = „ l {qN b)  , where 
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<f>„  = Vjln{N b / rii)  and  ni  = x 1016  [5]. 

Low  gate,  drain,  and  threshold  voltages  permit  all  doping  concentrations  to  be  reduced, 
once  again  due  to  lower  electric  field  strength.  This  has  two  benefits  for  low  voltage 
operation: 

1,  n is  reduced,  decreasing  the  subthreshold  slope  and  thus  reducing  the  supply  voltage 
(and  therefore  energy  per  operation)  necessary  to  achieve  the  desired  on/off  current 
ratio. 

2.  source/drain  capacitances  are  reduced,  further  reducing  energy  per  operation, 

10  Barriers  to  Practical  Implementation 

A number  of  practical  considerations  place  a lower  bound  on  supply  voltage.  These  are: 
external  interfacing,  controlling  device  thresholds,  maintaining  adequate  noise  margins, 
power  supply  design,  power  consumption  of  OFF  devices,  and  circuit  speed.  Multichip 
module  packaging  provides  the  opportunity  to  isolate  low-voltage  subsystems  from  other 
system  components.  Limits  to  low  voltage  operation  may  be  determined  to  a large  extent 
by  the  power  dissipation  in  level-shifting  interface  circuits.  Device  thresholds  have  been 
observed  to  vary  with  transistor  geometry  and  even  location  on  a chip  [3]. 

A 10  watt  power  supply  will  have  to  deliver  20amps  at  Vdd  = 500mV. 

11  CIS  Test  chip 

In  the  BiCMOS  process  at  Stanford’s  Center  for  Integrated  Systems,  pfet  gates  are  doped 
p-f  and  nfet  gates  are  doped  n-f.  This  means  that  if  the  channel  implant  is  excluded, 
both  devices  have  thresholds  close  to  zero  volts.  Vt  can  then  be  adjusted  by  adjusting 
the  substrate  bias  voltage.  We  have  implemented  a test  chip  which  contains  a number  of 
simple  circuit  structures  (see  Fig  10),  and  will  hopefully  have  some  results  in  time  for  the 
conference.  The  chip  has  the  following  characteristics: 

• Pfet  gates  doped  p+  have  Vt  ~ OV 

• Independent  substrate  and  well  biases 

• self-testing  convolutional  coder 

• ring  oscillator 

• VCO 

• single  nfet,  pfet,  nand,  latch 
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Figure  10:  Ultra  Low  Power  test  chip.  Separate  bias  voltages  together  with  zero-Vt  pfets 
permit  threshold  adjustment. 

12  Conclusions 

Submicron  CMOS,  together  with  3D  stacked  multichip  modules,  and  massively  parallel 
machines  demand  new  approaches  to  power  dissipation.  We  are  in  the  very  early  stages  of 
investigating  reducing  energy  by  reducing  supply  and  thresholds  voltages.  We  are  hopeful 
that  low  voltage  CMOS  can  find  widespread  use  in  performance  driven,  power  constrained 
systems. 
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Parallel  Optimization  Algorithms 
and  Their  Implementation  in  VLSI  Design 

G.  Lee  and  J.  J.  Feeley 
Department  of  Electrical  Engineering 
University  of  Idaho 
Moscow,  ID  83843 

Abstract-  Two  new  parallel  optimization  algorithms  based  on  the  simplex  method 
are  described*  They  may  be  executed  by  a SIMD  parallel  processor  architec- 
ture and  be  implemented  in  VLSI  design.  Several  VLSI  design  implementations 
are  introduced.  An  application  example  is  reported  to  demonstrate  that  the 
algorithms  are  effective. 

1 Introduction 

Optimal  system  control  is  an  important  part  of  modern  control  theory.  The  kernel  problem 
is  optimizing  the  behavior  of  systems,  as  in  minimizing  the  energy  or  cost  required  to 
accurately  reach  some  required  terminal  state.  The  search  for  the  control  which  attains  the 
desired  objective  while  minimizing  (or  maximizing)  a defined  system  criterion  constitutes 
the  fundamental  problem  of  optimal  control  [1]  [2]  [3] . 

To  date,  practical  applications  of  optimal  control  theory  are  still  quite  few  in  number. 
For  a class  of  systems  with  fast  response,  the  implementation  of  a real-time  on-line  optimal 
controller  has  been  difficult.  The  time-consuming  computation  required  for  optimal  con- 
trol solutions  has  been  a major  obstacle.  Modern  supercomputers  with  parallel  processing 
architectures  and  very  fast  computation  speed  are  not  a practical  solution  because  of  their 
weight,  size  and  cost.  Fast  computation,  small  size  and  low  cost  are  basic  requirements 
for  the  controller.  In  this  paper,  the  technique  of  an  algorithmically  specialized  computer 
is  suggested  to  achieve  an  optimal  controller  which  can  realize  both  real-time  computa- 
tion and  on-line  control  for  a rapidly  responding  system.  Effective  algorithms,  parallel 
architecture,  and  VLSI  implementation  are  involved  in  the  design  of  the  controller. 

Efficient  optimization  algorithms  are  very  necessary  for  solving  the  two-point  boundary- 
value  (TPBV)  problems  which  arise  in  optimal  control.  Chazan  and  Miranker  in  1970  [4] 
originally  proposed  a nongradient-based  parallel  search  algorithm  for  unconstrained  mini- 
mization which  is  suitable  for  execution  using  an  array  of  parallel  processors.  The  algorithm 
involves  the  parallel  execution  of  n linear  searches  along  the  same  direction,  starting  from 
n points,  when  the  dimension  of  the  vector  of  unknowns  is  n.  Travassos  and  Kaufman  [5] 
have  applied  the  algorithm  to  the  solutions  of  optimal  control  systems.  Housos  and  Wing 
in  1984  [6]  reported  a parallel  pseudo-conjugate  direction  algorithm  that  performs  a set  of 
n linear  searches  in  parallel  along  different  search  directions.  Those  parallel  optimization 
algorithms  proceed  by  univariate  optimization  so  that  they  are  MIMD-type  algorithms 
[7].  Although  they  may  be  used  to  solve  the  optimal  control  problem,  it  is  not  easy  to 
shift  them  to  VLSI  design  for  a small  size  and  low  cost  controller.  Two  new  parallel, 
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nongradient -based  algorithms  for  unconstrained  optimization  are  presented  in  this  paper. 
In  contrast  to  existing  parallel  optimization  algorithms,  the  new  parallel  algorithms  are 
based  on  a simplex  method  and  are  SIMD-type  algorithms  [7].  The  advantage  of  the  new 
algorithms  is  that  they  do  not  need  a Unear  search  and  may  be  easily  shifted  to  VLSI 
implement  ation . 

Three  kinds  of  design  schemes:  digitally  controlled  analog,  hybrid,  and  pure  digital,  are 
presented  in  this  paper.  Their  VLSI  implementation  and  their  performances  are  discussed. 


2 New  parallel  optimization  algorithms 

The  following  unconstrained  minimization  problem  is  considered: 


min  /(X),  X € »n, 

where  / : — » 5?,  and  is  usually  non-quadratic  and  nonhnear. 

We  wish  to  find  a point  X*  numericaUy  such  that,  if  e > 0,  then 


/(X*)  < /(X),  for  all  X : ||X  - X*||  < e. 

Two  parallel  simplex  algorithms,  PSl  and  PS2,  which  are  based  on  an  improved  simplex 
method  [8] [9]  and  use  parallel  function  evaluations,  are  stated  below. 

Algorithm  PSl:  The  algorithm  PSl  predicts  four  candidate  vertices  simultaneously 
in  one  iteration.  Therefore  at  least  four  parallel  processors  are  required.  Each  iteration 
includes  two  phases:  the  first  is  for  parallel  evaluations  and  the  second  is  for  choosing  a new 
vertex  to  generate  a new  simplex  via  function  value  comparison.  The  computation  time  for 
the  function  evaluations  is  always  longer  than  the  time  for  the  function  value  comparison. 
The  execution  of  parallel  function  evaluations  effectively  reduces  computation  time  since 
it  is  a major  part  of  the  time  for  one  iteration  cycle.  It  is  also  important  that  the  parallel 
function  evaluations  are  of  the  SIMD  type.  This  allows  the  algorithm  to  proceed  in  the 
SIMD  parallel  architecture.  The  number  of  parallel  function  evaluations  required  by  PSl 
is  only  about  half  the  number  required  by  the  improved  simplex  algorithm  of  Nelder  and 
Mead  [8] . 

The  algorithm  PSl  is  described  below: 

(0)  Initial  simplex: 

(0a)  Set  the  iteration  number  k — 0. 

(Ob)  Starting  point  v°  - (x\,  is  given.  An  initial  simplex 

V0  = [v°,  • • • j^n+il  f°rmed  in  parallel  by:  v°  = (1  — <5)v° 

o = f v°  + 6EiXf, 
v*+i  ( u0  + 6Ei,  otherwise 

where  E<  — {6  . , ,0  1 0 . . . 0},i  - 1,2, ...  ,n,  and  6 = 0.1 
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(Oc) 


Parallel  evaluation  of  the  function  value  at  the  vertices  of  Vo 

y°  = [/(»!), /(«*), 


(1)  Parallel  sorting: 

Set  k = k + 1.  Let  Sk  — [Xk,  Xk , • • • , X*+1]  and  Fk  = [/*,  f% , ■ • • , /*+1]  be  ordered 
Vk~'  and  Yk-\ 

Find  d,  d = max(||X*  — X*||),i  = 2,3,  ...,n  + 1,  if  d is  small  enough,  then  stop, 
otherwise  continue  as  follows. 


Denote 

X,  by  XI  Xth  by  Xk,  Xh  by  X*+1, 
fi  by  /f,  f,h  by  /*,  fh  by  fk+1. 

The  centroid  X is  the  mean  of  the  vertices  with  i / n | 1,  i.e., 


(2)  Parallel  computation: 

Xc  = (l-(3)X  + (3Xh 

Xa  = (l+(3)X-(3Xh 
Xr  = (1  + a)X  - aXh 
Xe  = (l+7)X-7Xa 

Parallel  function  evaluations 

fc  = f{Xc),  fa  = f(xa),  fr  = f(Xr)  and  U = f(X') 

(3)  Comparison  and  selection  of  new  point  for  updating  simplex: 

(3a)  If  ft  < U < fh  then  Xh  = Xe  and  fh  = fe. 

(3b)  If  f,h  > fr  > fi  or  fe  > fr  > fi , then  Xh  = Xr  and  fh  = fT. 

(3c)  If  fh>  fr  > f,h  and  fa  < f,h , then  Xh  = Xa  and  fh  = fa- 
(3d)  If  fT  > fh  and  fc  < f,h , then  Xh  = Xc  and  fh  — fc. 

(3e)  If  fT  > fh  and  fc  > f,h  or  if  fh  > fr  > f»h  and  fa  > f,h , do  shrinkage  in  parallel: 
X*  — [Xi,Xtj],  where  Xtj  = ( Xj  + Xx)/2,  j = 2,3,  ...,ra  + 1,  evaluate  and 
update  Fk  = \fi,f(XtJ),  f(X,3),  • • • ,/(X#n+t )]  and  Sk  = Xt,  then  do 

(4)  Update  the  simplex: 

let  Vk  = Sk,  Yk  = Fk , then  return  to  (1). 
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Algorithm  PS2:  The  algorithm  PS2  is  developed  from  the  algorithm  PSl  by  increas- 
ing the  parallel  processors  to  sixteen.  Twenty  processors  in  total  are  utilized  to  predict 
twenty  candidate  vertices  simultaneously  in  one  iteration.  The  algorithm  PS2  is  more  effec- 
tive than  the  algorithm  PSl.  One  iteration  of  the  algorithm  PS2  is  functionally  equivalent 
to  two  iterations  of  the  algorithm  PSl.  Thus  the  algorithm  PS2  will  do  the  same  function 
in  roughly  half  the  time  of  the  algorithm  PSl.  Algorithm  PS2  is  also  of  the  SIMD-type, 
The  algorithm  PS2  is  described  below: 

(0)  The  same  as  Step  (0)  of  Algorithm  PSl; 

(1)  The  same  as  Step  (1)  of  Algorithm  PSl; 

(2)  Parallel  computation: 

(2a)  Compute  the  first  level  direction  points  (four  in  total)  in  parallel: 

Xc  = (l  -/3)X+/3X,, 

X0  = (l  +p)x-pxh 

Xr  = (1  + a)X  - aXh 
Xe  = (1  + j)X  - 7 xh 

and  find  4 conductive  points  in  parallel 

Xi  = £(£"=!  X}  + Xi),  i =’c\  V,  V’,  V, 

(2b)  Compute  the  second  level  direction  points  (sixteen  in  total)  in  parallel: 

Xic  = (1  - /3)Xi  + pX.h 
Xia  = (1  + P)Xi  - PX,h 
Xir  = (1  + a)Xi  - aX,h 
Xie~(l+7)Xi~lfX.h 
where  t =’c’,  ’a’,  V’,  ’e’, 

(2c)  ParaJlel  function  evaluations 

fi  = f(X i) 

fic  = f{Xic),  fia  = f(Xia),  fir  = f{Xir)  and  fit  = f{Xie) 
where  i =V,  ’a’,  V’,  V, 

(3)  Comparison  and  updating  simplex: 

Set  m — 0. 

(3a)  If  fe<  fr<  fh  3 =’e’>  or 

if  u > fr  > fi  or  /e  > fr  < /P,  3 = V,  or 
if  fh>  fr>  fsh  or  fa  < f.h , 3 =’«\  or 

if  fT  > fh  and  fc  < f,h,  3 -c\  set  m = m + land  do  (3b)  _ 

if  fr  > fh  and  fc  > f.h,  or  if  fh  > fr  > f>h  and  fa  > f.h,  do  shrinkage  in 
parallel,  X,  = {Xj.X,,},  where  X.,  = (X.  + X,)/^,  j = 2,S,--,»  + l.  evaluate 
Fh  = {/.,/(X„),/(X„),-  • ■ ,/(X,„,,)}  and  let  » = X„  then  do  (4). 
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(3b)  Replacing: 

• Xh  = X,h,  fh  — f ih j X,h  = Xj,  f sh  — fji  and  X{  — Xji , /»  — fjf 

If  m = 1 do  (3a),  otherwise  do  (4). 

(4)  The  same  as  the  step  (4)  of  the  algorithm  PS1. 

3 Example  of  application  to  real-time  optimal  control 

The  air-to-air  missile-target  intercept  is  a practical  real-time  optimal  control  application. 
A typical  intercept  mission  from  missile  launch  to  intercept,  may  take  only  a few  seconds. 
It  is  almost  impossible  to  achieve  true  optimal  control  during  such  a short  time  interval 
with  present  technology.  The  efficiency  of  the  algorithm  PS2  for  real-  time  application  of 
optimal  control  is  demonstrated  in  this  section  via  simulation  of  a 3-dimensional  air-to-air 
missile-target  intercept  problem.  An  optimal  guidance  law  that  minimizes  missile  energy 
expenditure  with  fixed  final  time  tf  and  fixed  final  state  (zero  miss  range)  is  derived  in 
Ref  [9]  using  nonlinear  optimal  control  theory.  This  section  focuses  on  solving  the  non- 
linear TPBV  problem  (NTPBVP)  which  arises  in  the  intercept  problem  by  the  “shooting 
method”  using  Algorithm  PS2. 


Figure  1 shows  the  3-dimensional  intercept  scenario.  The  target  T moves  in  a straight 
line  at  constant  velocity  i>t  and  the  missile  M moves  at  controlled  acceleration  a.(t)  and  its 
direction  angles  are  a:(t)  and  /3(t).  An  on-board  optimal  controller  in  the  missile  calculates 
and  provides,  a(f),  a(t)  and  (3(t)  to  the  missile  thruster. 

The  NTPBVP  obtained  is 
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ii 

= 

x4 

Xj(0)  = xao,  *1  (</)  - 0 

*3 

= 

*6 

®3(0)  = X20,  X2  (tf)  = 0 

*3 

X6 

x3(0)  — x30,  x3 (tf)  = 0 

X4 

a(x|0  -f*  x,,)xio 

054(0)  = x40 

is 

= 

-a(xj0  + xjjxu 

05s(0)  = X50 

ie 

= 

— OXj2 

*s(0)  = x60 

*7 

0 

is 

— 

0 

i9 

= 

0 

iio 

x^ 

Xio(t/)  = 0 

in 

*8 

05ll(t/)  = 0 

ii? 

059 

0513  (if)  = Q 

where  - — - - — ..  ’ ....  . . 

a — (*io  d*  zii)1  d"  ®?2"  

Notice  that  the  initial  conditions  xio  to  x6p  are  constant  and  the  termintil  values  xi(tf) 
to  x3(</)  and  xxp(t/)  to  xi2(t/)  are  zeros.  The  first  six  equations  of  (1)  are  the  dynamics 
of  the  system.  The  second  six  equations  are  the  co-state  equations.  The  shooting  method 
starts  with  estimating  a set  of  initial  values  (xr(0)x8(0)x9(0))r,  then  integrates  (1)  forward, 
with  given  and  estimated  initial  values  xj(0)  to  Xj2(0).  The  resulting  terminal  values  are 
usually  different  from  the  given  ones.  An  error  function  E is  defined  by 


E = y/x^tf)2  + x2(</)2  + *s(</)3  d-  *ip(</ ja  + *u(</)2  + *u(*/)a  ■ (2) 

The  shooting  method  attempts  to  minimize  the  error  function  E: 

min  E — > 0 (3) 

This  can  be  done  by  means  of  the  algorithm  PS2  to  update  the  estimated  initial  values 
until  (3)  is  satisfied. 

The  initially  given  condition  is  1 


0510  - 

20000 

(ft) 

X20  = 

3000 

(ft) 

X30  = 

2500 

(ft) 

X40  — 

-972 

(ft/ sec) 

X50  = 

-972 

(ft/sec) 

X60  — 

0 

(ft/sec) 

and  the  fixed  final  time  is  tf  = 5(sec). 

Assume  the  target  velocity  Vj  is  constant,  the  travel  path  of  the  target  will  be  a straight 
line.  The  target  path  may  be  calculated  correctly  by 


’Data  taken  from  Ref.  [10] 


LIIMIlll 
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XT(t)  = Xot  + Vft 

yr{i)  = Vot 

zT(i)  = ZqT 


where  [xor  2/or  ^or]T  is  the  target’s  initial  position  so  that  open-loop  optimal  control 
may  be  employed  by  the  missile. 

To  come  up  with  open-loop  optimal  control  numerically,  one  must  first  solve  the  NTP- 
BVP  (1).  For  a set  of  rough  initial  estimations 

x70  = 1 
Xso  — 1 

a?90  — 1 


using  the  algorithm  PS2,  the  resulting  solutions  are  in  Table  1: 


TI  (sec) 

OIV 

PFE 

CMR  (ft) 

RMR  (ft) 

[0  5] 

x7(0)=0.65993 
x8(0)=-0. 08107 
x9(0)=0.92175 

114 

0 

1.287e-9 

Table  1:  The  numerical  results  of  the  intercept  scenario 


TI-Time  interval, 

OlV-Optimal  initial  values, 

PFE-Parallel  function  evaluations, 

CMR-Constraint  of  miss  range, 

RMR-Real  miss  range. 

As  a rough  estimation  of  computation  time,  if  the  PFE  < 120  in  five  seconds  as  shown 
in  Table  1,  then  the  real-time  optimal  control  can  be  implemented  for  this  air-to-air  missile- 
target  intercept  problem.  This  is  very  possible  with  modern  VLSI  techniques.  In  the  next 
section  several  VLSI  design  possibilities  are  introduced. 

4 VLSI  implementations 

This  section  presents  design  possibilities  for  potential  real-time,  on-line,  optimal  con- 
trollers. These  optimal  controllers  will  be  algorithmically  specialized  parallel  computers 
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consisting  of  a few  VLSI  chips.  Small  special-purpose  optimal  controllers  should  be  useful 
for  certain  optimal  control  systems,  such  as  aircraft  control,  missile  guidance,  etc. 

To  conserve  space,  only  the  algorithm  PSl  is  considered  for  VLSI  implementation  in 
this  section.  The  design  procedure  can  be  used  for  algorithm  PS2,  but  the  resulting  circuit 
will  be  more  complex. 

4.1  Schematic  design 

A schematic  diagram  of  the  implementation  of  the  algorithm  PSl  is  shown  in  Figure  2. 
The  dashed  box  performs  the  main  function  of  the  algorithm  PSl.  In  order  to  be  useful  for 
various  control  systems,  a parallel  function  evaluator  (PFE)  is  separated  from  the  dashed 
box.  The  PFE  is  an  array  of  four  2 parallel  processors.  The  complete  system  includes  two 
separate  parts:  the  main  algorithm  part  and  the  PFE.  The  main  part  is  the  algorithm 
itself  in  which  the  design  is  fixed.  The  PFE  is  more  flexible  and  is  different  from  system 
to  system. 


Figure  2:  Block  diagram  of  the  algorithm  PSl 

The  operation  of  the  system  outlined  in  Figure  2 may  be  described  as  follows. 

The  IS,  connected  to  the  input  X0j  is  for  the  generation  of  an  initial  simplex  [Xqi 
X0(n+i)]-  Via  the  multiplexer  (Mul),  the  function  values  on  the  initial  simplex  may  be 
evaluated  by  the  external  PFE.  The  outputs  of  the  PFE,  [/(A\),  • • • , /(Xn+1)],  via  the 


2Twenty  for  the  algorithm  PS2. 
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demultiplexer  (DMul)  and  a set  of  updating  switches  (USw),  are  saved  in  the  simplex 
memories  (SMe).  The  Mul  and  the  DMul  also  pass  the  initial  simplex  vertexes,  denoted 
as  [Xj  • • -Xn+i],  to  the  SMe.  Then  a basic  simplex  with  its  function  values  is  stored  for 
further  operations. 

According  to  the  algorithm  PS1,  the  stored  simplex  must  be  updated.  To  do  this,  the 
simplex  in  the  SMe  must  be  first  sorted  by  a sorting  circuit  (Sorting).  A sorted  simplex, 
[Xj,  • ■ • , X,h,  -X"h]  with  function  values  [/i,  • • • , is  available  at  the  output  of  the 

Sorting.  From  it  four  direction  points,  Xc,  X0,  XT  and  Xe,  can  be  found  in  the  direction 
points  module  (DP).  Similar  to  the  initial  simplex,  they  and  their  function  values,  fc,  fa,  fr 
and  /e,  evaluated  by  the  PFE  are  stored  in  the  direction  memories  (DMe)  via  the  Mul  and 
the  DMul.  A new  point  module  (NP)  compares  [fc,  fa,  fr  and  /e]  with  [//,  f,h  and  fh]  and 
then  selects  a proper  one  of  the  direction  points,  denoted  by  X£,  with  its  function  value 
f'h.  Via  the  USw  the  X£  replaces  the  vertex  Xh  to  update  the  basic  simplex.  The  positions 
of  Xh  and  fh  are  indexed  by  one  of  the  signals  g\  to  ^n+1  generated  by  the  Sorting. 

In  case  no  new  point  can  be  selected,  the  NP  will  send  out  a digital  signal  Ds.  Through 
it  the  DTC  generates  another  control  signal  Cs  to  the  Mul  and  the  DMul,  then  a shrunken 
simplex  from  the  simplex  shrinkage  (SSh),  [Xi,X,i,  ■ ■ ■ ,X,n],  with  its  function  values  is 
passed  to  the  SMe,  so  that  the  basic  simplex  is  updated. 

The  simplex  size  module  (SSi)  and  the  convergence  testing  module  (CT)  monitor  the 
size  of  the  sorted  simplex  and  its  minimal  function  value.  Together  with  the  size  switches 
(SSw)  and  the  Sorting,  when  one  of  them  satisfies  a given  criterion,  the  CT  will  send  a 
“stop”  signal  to  finish  the  iterations. 

The  digital  timing  controller  (DTC)  is  necessary  to  control  the  timing  of  the  whole 
system.  The  functions  of  the  DTC  may  be  stated  by  defining  its  inputs  and  outputs  as 
follows: 


Inputs: 

Start: 

T: 

Repeat: 

Stop: 

Ds: 

Outputs: 

Ce: 

Ci: 

Cs: 

Cd: 

Cr: 

Cm: 

Ct: 


actuates  the  DTC  and  starts  the  computation, 

a parameter  given  for  setting  up  the  width  of  the  Ce’s  active 

interval, 

after  the  computation,  reactuates  the  DCT  and  repeats  the 
solutions  if  necessary, 

stops  the  iteration  when  the  solutions  are  available, 
active  when  shrinkage  simplex  is  needed,  sets  up  the  Cs, 


actuates  the  PFE,  the  Mul  and  the  DMul,  its  active  length  is  given 
by  the  input  T, 

passes  the  initial  simplex  and  its  function  values  to  the  SMe  via  the 
Mul  and  the  DMul, 

passes  the  shrinkage  simplex  and  its  function  values  to  the  SMe  via 
the  Mul  and  the  DMul,  it  is  controlled  by  the  input  Ds, 
passes  the  direction  points  and  their  function  values  to  the  DMe 
via  the  Mul  and  the  DMul, 

repeats  the  computation,  it  is  controlled  by  the  input  ’’Repeat”, 

actuates  the  SMe  and  the  DMe, 

tests  the  simplex  size,  active  at  each  iteration. 


The  design  of  the  DTC  is  a normal  digital  logic  design  and  is  not  included  here. 
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To  meet  various  application  requirements,  three  kinds  of  design  schemes  (1)  digitally 
controlled  analog,  (2)  hybrid,  and  (3)  all-digital,  are  suggested  here.  The  design  of  digitally 
controlled  analog  is  due  to  analog  computation  on  both  the  PFE  and  the  main  algorithm 
part.  The  hybrid  design  uses  digital  computation  for  the  main  algorithm.  Finally  the 
digital  design  is  a pure  digital  scheme.  Due  to  their  different  characteristics,  they  are 
employed  in  different  circumstances  as  listed  below. 

For  the  digitally  controlled  analog  controller: 

1.  Accuracy  limited  but  faster  computation. 

2.  Limited  memory  period. 

3.  More  efficient  for  low  frequency  systems  with  shorter  operation  time. 

For  the  hybrid  controller: 

1.  Same  as  1 above. 

2.  Unlimited  memory  period. 

3.  More  efficient  for  low  frequency  systems  with  longer  operation  time. 

For  the  digital  controller: 

1.  Accuracy  unlimited  but  slow  computation. 

2.  Unlimited  memory  period. 

3.  No  strong  relation  to  frequency  and  time  of  system  operation. 

4.2  Digitally  controlled  analog  scheme 

In  general,  analog  computation  is  faster  than  digital  computation.  This  suggests  the  PFE 
and  the  main  algorithm  part  (not  including  the  DTC)  may  be  implemented  by  analog 
techniques.  However  analog  long-time  memory  is  not  easily  implemented  on  a VLSI  chip. 
Memory  time  is  strongly  depended  on  the  problem’s  complexity.  If  the  requirement  for 
memory  time  is  too  long  and  the  size  requirement  is  critical,  the  hybrid  computation 
scheme  should  be  considered  as  below. 


4.3  Hybrid  scheme 

The  hybrid  scheme  includes  digital  computation  and  memories  in  the  main  algorithm  part. 
But  the  PFE  still  uses  analog  techniques.  In  practice,  the  PFE  is  a parallel  electronic 
differential  analyzer  (EDA)  which  consists  of  some  integrators.  Integration  computations 
is  more  convenient  with  analog  circuiting  than  with  digital.  Keeping  the  PFE  in  analog 
will  reduce  computation  time.  A hybrid  scheme  is  suggested  in  Figure  3.  The  system  has 
three  sections:  the  PFE,  the  digital  algorithm  processor  and  the  linkage  system,  in  which 
some  analog/ digit al  (A/D)  and  digital/analog  (D/A)  converters  are  essential. 

Each  numerical  value  in  the  digitally  controlled  analog  scheme  mentioned  above,  such 
as  each  function  value,  each  element  value  of  a simplex  vertex  and  each  constant  value, 
will  be  described  in  m-bit  form  and  be  stored  in  a m-bit  register.  In  order  to  achieve 
parallel  computation,  the  input  and  output  to  these  registers  are  parallel  m-bit  data  buses. 
Furthermore,  all  of  the  digital  devices  in  this  system  are  parallel. 
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Figure  3:  Hybrid  scheme 


4.4  Digital  scheme 

Based  on  the  hybrid  scheme,  a pure  digital  optimal  controller  maybe  obtained  by  designing 
a digital  PFE.  A key  point  is  to  design,  for  the  PFE,  a digital  integrator,  which  is  very 
different  from  an  analog  one.  The  design  of  the  digital  PFE  is  related  to  both  the  solution 
methods  and  the  particular  problem,  and  may  be  separated  into  two  parts.  The  first 
one  is  an  algorithmically  specialized  unit  of  a solving  algorithm,  such  as  the  Runge-Kutta 
algorithm,  and  another  is  a computing  unit  of  a given  differential  equation. 

5 Conclusion 

Two  new  parallel  optimization  algorithms,  PS1  and  PS2,  based  on  the  simplex  method 
are  described.  Four  processors  are  required  for  PS1  and  twenty  for  PS2.  They  may  be 
executed  by  a SIMD  parallel  processor  architecture  and  may  be  easily  shifted  to  VLSI 
design. 

The  numerical  result  of  a 3-dimensional  air-to-air  missile-target  intercept  problem  has 
been  reported  to  demonstrate  that  the  algorithms  are  effective  and  the  real-time  optimal 
controllers  are  feasible  for  a class  of  optimal  control  systems  with  fast  response. 

As  a design  example,  the  algorithm  PS1  has  been  shifted  to  a VLSI  implementations. 
Three  types  of  controller  design  schemes  have  been  presented:  (1)  digitally  controlled 
analog,  (2)  hybrid,  and  (3)  pure  digital  controller.  They  can  be  employed  satisfactorily  for 
different  application  requirements. 

In  general,  the  optimal  controllers  converge  rather  rapidly,  once  the  estimation  of  an 
initial  value  is  found  such  that  the  evaluation  of  the  error  function  E being  minimized  re- 
sults in  a number  in  the  neighborhood  of  zero.  However,  if  the  problem  to  be  solved  is  very 
sensitive  to  small  perturbations  in  the  initial  co-state  vector,  convergence  to  an  optimal 
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solution  may  be  slow,  or  even  fail.  This  case  was  not  considered  in  this  research.  To  over- 
come this  problem  a method  [5]  suggested  by  R.  Travassos  and  H.  Kaufman  may  be  added 
in  the  design  of  the  optimal  controllers.  This  approach  is  currently  under  consideration. 
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Abstract  - There  is  recently  an  increased  interest  in  logic  synthesis  using  EXOR 
gates.  The  paper  introduces  the  fundamental  concept  of  Orthogonal  Expansion , 
which  generalizes  the  ring  form  of  the  Shannon  expansion  to  the  logic  with 
multiple-valued  (mv)  inputs.  Based  on  this  concept  we  are  able  to  define  a 
family  of  canonical  tree  circuits.  Such  circuits  can  be  considered  for  binary 
and  multiple-valued  input  cases.  They  can  be  multi-level  (trees  and  DAGs) 
or  flattened  to  two-level  AND-EXOR  circuits.  Input  decoders  similar  to  those 
used  in  Sum  of  Products  (SOP)  PLAs  are  used  in  realizations  of  multiple- 
valued input  functions.  In  the  case  of  the  binary  logic  the  family  of  flattened 
AND-EXOR  circuits  includes  several  forms  discussed  by  Davio  and  Green.  For 
the  case  of  the  logic  with  multiple-valued  inputs,  the  family  of  the  flattened 
mv  AND-EXOR  circuits  includes  three  expansions  known  from  literature  and 
two  new  expansions. 

1 Introduction 

Although  the  EXOR  gate  exists  in  most  VLSI  cell  libraries,  there  are  no  logic  synthesis 
systems  that  find  optimized  multi-level  circuits  using  EXORs.  The  recently  developed  PLD 
devices,  such  as  Programmable  Gate  Arrays  (Xilinx  LCA  3000)  [33],  Signetics  LHS501  [32], 
Actel  [7]  or  other  [13],  either  include  EXOR  gates,  or  allow  to  realize  them  in  the  ’’universal 
logic  modules”.  Since  the  five  input  EXOR  gate  in  Xilinx  device  has  the  same  speed  and 
cost  as,  for  instance,  a five  input  OR  gate  [5],  the  new  design  methods  are  neeeded  for 
such  technologies  that  will  assume  the  usage  of  EXOR  gates  on  the  same  full  rights  as 
the  AND  and  OR  gates.  Particularly,  if  a Reed-Muller  [15,22]  form  has  less  terms  than 
a two-level  AND-OR  expression,  this  form  should  be  used  for  Xilinx  realization,  and  not 
the  SOP  expression,  as  it  is  done  nowadays. 

The  problem  of  finding  the  minimal  generalized  Reed-Muller  (GRM)  canonical  form  of 
optimal  polarity  [14]  (called  also  fixed-polarity  Reed-Muller  [9]),  as  well  as  the  problem  of 
finding  the  minimal  Exclusive  Sum  of  Products  (ESOP)  of  a Boolean  function  [2,10,27,28], 
are  the  classical  ones  in  logic  synthesis  theory,  but  exact  solutions  to  them  have  been 
proposed  for  only  small  functions  [16,17]. 

Solving  the  above  two  problems,  and  creating  other  new  methods  of  multi-level  EXOR 
circuits  design  is  practically  important  for  several  reasons:  (1)  It  has  long  been  the  ex- 
perience of  logic  designers,  that  the  EXOR  circuits  can  be  more  economical  than  the 


1This  research  was  supported  in  part  by  the  NSF  Research  Initiation  Award  for  the  first  author 


11.3.2 


conventional  inclusive  (AND-OR)  normal  circuits.  This  was  also  confirmed  practically  on 
many  practical  examples,  especially  on  arithmetic  and  telecommunication  circuits  [1,12]. 
It  was  also  proven  theoretically  [26]  on  worst  case  and  arithmetic  functions.  (2)  The  struc- 
ture of  EXOR  circuits  implementations  is  especially  suitable  for  VLSI,  optical,  and  some 
other  recent  technologies.  The  RM  and  GRM  forms  have  absolutely  superior  design-for- 
test  properties  [6,11,21,23],  unmatched  by  other  realizations.  This  was  not  used  in  the 
past  since  the  EXOR  gate  realizations  were  slow  and  area-expensive.  With  the  arrival  of 
PGA  devices  this  deficiency  no  longer  holds  and  the  theories  developed  for  instance  in 
[6,11,21,23]  should  be  practically  used.  (3)  Currently,  the  widely  used  logic  minimization 
programs  such  as  Espresso  [3]  and  MIS  II  do  not  take  into  account  EXOR  gates  in  their 
minimization  processes  which  often  causes  nonminimal  results.  There  is  a growing  indus- 
trial interest  among  CAD  logic  synthesis  tools  users  community  to  have  a program  that 
would  generate  optimized  circuits  including  EXOR  gates  [12],  and  such  tools  start  to  be 
introduced  to  CAE  market  (for  instance  by  Mentor  Graphics  Inc.).  (4)  The  new  tools  for 
ESOP  synthesis  are  either  heuristic  [2,10,18,27,28]  or  produce  exact  solutions  for  general 
ESOPs  [16,17,20],  but  are  so  slow  that  can  be  applied  only  to  small  functions.  For  few 
canonical  forms  included  in  ESOPs  optimal  programs  exist  for  functions  of  about  10  vari- 
ables [29,30,31].  It  is  therefore  important  to  construct  programs  that  will  be  faster  than 
the  current  exact  minimizers  and  still  be  able  to  produce  quasi-minimal  solutions. 

A book  by  Davio  [4]  and  a paper  by  Green  [9]  give  information  on  the  numbers  and 
properties  of  various  canonical  forms  being  specializations  of  binary  ESOPs,  which  may 
be  useful  to  create  efficient  algorithms  for  them.  In  [19]  we  presented  a family  of  multiple- 
valued  input  expansions.  In  this  paper  we  will  present  a subset  of  the  family  from  [19],  but 
we  will  present  the  material  in  a more  complete  and  systematic  way.  We  will  introduce 
new  canonical  binary  and  multiple- valued  forms  and  expressions.  Forms,  Directed  Acyclic 
Graphs  (DAGs),  Trees  and  expressions  obtained  by  the  introduced  here  tree  searching 
methods  will  be  all  called  expansions.  The  ultimate  goal  of  the  research  reported  here  is 
to  create  synthesis  programs,  exact  and  approximate,  for  all  known  and  some  new  forms 
being  subsets  of  binary  and  multiple- valued  input  ESOPs. 

2 Binary  Generalizations  of  Reed-Muller  Forms 

A Reed- Muller  (RM)  expression  (for  binary  logic)  is  an  exclusive  sum  of  products  of  pos- 
itive (non- complemented)  input  variables.  A Negative  Reed-Muller  (NRM)  expression  is 
an  exclusive  sum  of  products  of  negative  (complemented)  input  variables.  Both  these 
expansions  are  called  Single  Polarity  Reed-Muller  Forms . 

Definition  2.1.  The  literal  xtc  is  a variable  in  either  positive  ( x*  ) or  comple- 
mented ( X{  ) form. 

Let  us  consider  the  following  form: 

f{x\,  Xn)  = £o  © 9lx\  © © 9nxn  © 9n+lx\x2  © ® 92n~lxl  x2  xn  (1) 

where;  gi  — 0 or  1,  and  x;c  = x*  or  x*. 


3rd  NASA  Symposium  on  VLSI  Design  1991 


11.3.3 


Definition  2.2.  By  a Generalized  Reed-Muller  Form  (GRM)  one  understands  a form 
1 in  which  each  variable  can  be  complemented  (negative)  or  not  complemented  (positive), 
but  can  not  stand  in  both  forms. 

Such  forms  are  canonical,  which  means  that  only  one  such  form  exists  for  every  polarity 
of  variables  (there  are  2n  such  polarities  for  a Boolean  function  of  n binary  inputs,  which 
means  that  there  are  2n  corresponding  GRM  forms).  Applying  the  principle  of  duality  to 
all  presented  forms  one  gets  the  dual  forms:  the  system  (©,  •)  is  replaced  with  the  dual 
system  (©,  +).  All  results  of  this  paper,  after  applying  the  principle  of  duality  to  them, 
hold  in  the  dual  system  as  well.  Let  us  observe  that  the  circuits  generated  for  both  systems 
can  be  implemented  using  EXOR  and  NOR  gates  or  EXOR  and  NAND  gates. 

By  ’’flattening”  we  understand  applying  the  Boolean  rule,  a(b  © c)  = a b © a c. 
Flattening  is  used  to  convert  trees  and  multi-level  expressions  to  two-level  expressions, 
such  as  Reed-Muller  forms,  or  ESOPs. 

The  well-known  Shannon  expansion  for  the  case  of  ESOP  expansion  is  as  follows: 

/(xa,...,Xi,...,xn)  = 

0,  Xn)  © Xi  • = l,...,xn)  (2) 

By  applying  laws  a = 1 © a and  a = 1 © a one  gets: 

/(Xi,  ...,  Xj,  ...,  xn)  — 

/(xi , x^  = 0, ...,  xn)  © Xj  * [/(xl5  ...,x;  = 0, ...,  xn)  © /(x1}  - 1,...,  xn )]  (3) 

and 

/(xi ) ...j  Xj,  xn)  — 

f{xu."jZi  = l,...,xn)  © Xi  • [f(xXy...yXi  = 0,...,xn)  © f(xly...yXi  = l,...,xn)]  (4) 

In  the  short  form: 

/ — Xi  fXi  © x,  • (5) 

/ — f*i  © X{  • [fXi  © fx%]  (6) 

/ = /.<©  Xi  • [/..  © U J (7) 

Let  us  observe  that  these  expansion  formulas  have  been  applied  by  several  authors  for 

the  synthesis  of  GRM  forms  for  completely  specified  functions  [4].  Davio  [4]  and  Green 

[9]  use  them  as  a base  of  Kronecker  Reed-Muller  ( KRM ),  Pseudo-Kronecker  Reed-Muller 
(PKRM),  and  Quasi- Kronecker  Reed-Muller  (QKRM)  forms  (G  reen  uses  also  trees  for 
better  explanation).  If  only  rule  6 is  used  repeatedly  for  some  fixed  order  of  expansion 
variables,  the  RM  Trees  are  created,  which  correspond  to  RM  forms  after  their  flattening. 
If  for  every  variable  one  uses  either  rule  6 or  rule  7,  the  GRM  Trees  are  created,  from 
which  GRMs  are  obtained  by  flattening  (which  proves  in  other  way  why  there  is  2n  of  such 
forms).  If  for  every  variable  one  uses  either  rule  5,  rule  6,  or  rule  7,  the  KRM  Trees  are 
created,  from  which  KRMs  are  obtained  by  flattening  (which  proves  in  other  way  why  there 
is  3n  of  such  forms).  If  rules  5,  6 and  7 are  used,  but  in  each  subtree  there  is  a choice  of  a 
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rule,  the  PKRM  Trees  are  generated  from  which  PKRM  forms  are  obtained  by  flattening. 
Now,  if  additionally  we  allow  the  expansion  variables  to  have  various  orders  (but  the  same 
in  the  entire  tree),  one  obtains  the  QKRM  Trees,  and  PKRM  flattened  forms,  respectively. 
One  can  now  see  that  a further  natural  generalization  is  to  allow  various  orders  of  variables 
in  subtrees  of  QKRM  trees  to  create  an  even  wider  family  of  trees.  There  are  two  ways  of 
generalizing  those  forms  for  the  logic  with  multiple- valued  inputs.  One  was  shown  in  [19]. 
The  other  one  will  be  presented  here. 


3 Generalizations  of  Reed-Muller  Forms  for  the  Logic 
with  Multiple- Valued  Inputs 

Definition  3.1.  A multiple- valued  input,  two- valued  output,  completely  specified  switch- 
ing function  f (multiple-valued  function , for  short)  is  a mapping:  f(X i,  Xi,  ...  , X„): 

Pi  x Pj  x ...  Pn  — > {0,1},  where  Xi  is  a multiple -valued  variable,  Pi  = {0,  1,  ...  , p, 
- 1}  is  a set  of  truth  values  that  this  variable  may  assume.  This  is  a generalization  of  an 
ordinary  n-input  switching  function  f:  {0,1}"  — > {0,1}. 

Definition  3.2.  For  any  subset  Si  C Pi,  Xi5'  is  a literal  of  X,.  The  set  of  values  S,  will 
be  called  the  polarity  of  literal  X,s\  The  literal  X;S< , where  Si  E Pi  is  defined  as  follows: 
XiSi  = 1 if  Xi  E Si\  and  A}5*  — 0 otherwise. 

Example  3.1.  For  values  0,1  or  2 of  a 5-valued  variable  X,  the  literal  A0’1,3  equals  1. 
For  values  3 or  4 of  a 5-valued  variable  X,  the  value  of  the  literal  A0,1,3  equals  0. 

Definition  3.3.  A product  of  literals,  Xx5‘  A^1  ...  An5*,  is  referred  to  as  a product  term 
(also  called  term  or  product  for  short).  A sum  of  products  is  denoted  as  a (multi-valued 
input)  sum- of -products  expression  ( SO  PE ). 

Example  3.2.  2-bit  decoders  have  pairs  of  primary  inputs  of  the  function  as  their  inputs. 
Assume  pairing  of  variables  X\  — (2:;,  Xj).  The  corresponding  2-bit  decoder  has  two  input 


+ Xj,  Xi 


+ Xj,  Xi  + Xj,  Xi  + 


Xj.  Those 


variables;  Xi  and  Xj,  and  23  = 4 outputs:  Xi 
outputs  correspond  to  the  following  literals  of  variable  Ax  : Xx0,1,2,  Aj0’1’3,  Ai°’3’3,  Aj1’3’3, 

respectively. 

Switching  functions  with  multiple- valued  inputs,  two-valued  outputs,  find  several  ap- 
plications in  logic  design,  pattern  recognition,  and  other  areas.  In  logic  design,  they  Eire 
primarily  used  for  the  minimization  of  PL  As  that  have  2-bit  decoders  on  the  inputs.  A 
Programmable  Logic  Array  (PLA)  with  r-bit  input  decoders  directly  realizes  a SOPE  of 
a 2^-valued  input,  two-valued  output,  function  [25].  Such  decoders  can  be  also  used  in 
any  other  realization  of  the  logic  with  multiple- valued  inputs,  like  multiple- valued  input 
ESOPs  [18,27].  A simplified  form  of  such  decoders  was  used  in  [29,30,31]  in  the  realization 
of  Multiple -Valued  Input  Kronecker  Reed-Muller  Forms  (MIKRMs).  It  will  be  also  used 
in  the  ’’fixed-polarity”  Multi-Valued  Input  Kronecker  Reed-Muller  Trees  (MIKRMTs)  that 
will  be  introduced  here.  This  simplification  consists  in  creating  a simplified  decoder  with 
2r  — 1 outputs  for  r input  signals.  The  set  of  outputs  of  the  simplified  decoder  is  a subset  of 
functions  (all  but  one)  realized  by  a standard  decoder.  For  instance,  for  a 4-valued  input 
signal  X one  needs  any  three  outputs  of  a standard  decoder  from  Example  3.2. 
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In  the  case  of  binary-input  logic,  each  variable  from  a GRM  form  can  have  one  of 
two  possible  polarities,  0 or  1.  The  notation  used  for  binary  functions  is:  x*0  = Xj,  Xj1  = 
x;.  Let  us  observe  that  if  two  polarities  were  available  for  even  a single  variable,  then  the 
ESOP  expression  including  literals  of  both  polarities  would  be  not  canonical,  for  instance 
x and  1 © x would  represent  two  different  expressions  for  the  same  function  f(x)  — x. 

The  question  arises,  how  to  create  canonical  generalized  Reed-Muller  forms  for  multiple- 
valued input  logic.  Methods  were  shown  in  [19,29,30,31].  Here  we  will  present  another 
method,  that  allows  for  more  general  interpretations.  It  is  next  used  to  create  the  fam- 
ily of  canonical  forms  and  trees.  Let  us  first  observe  that  in  a logic  with  a pj-valued 
input  X{  there  exists  pi  different  single  logic  values:  for  variable  X;  one  can  create  p, 
different  literals  with  arbitrary  single-value  polarities.  It  is  obvious  that  if  we  will  take 
all  those  literals  to  the  ESOP  expression,  then  there  will  be  more  that  one  way  to  de- 
scribe any  Boolean  variable  function  of  a single  variable.  If  a single  literal  from  this  set 
of  literals  is  removed,  then  the  remaining  literals  describe  any  single-input  function  in  an 
univocal  (canonical)  way.  For  instance,  for  a p-valued  variable  X one  has  the  literals: 
X'°,  X 11 , X'3,  ...  , X'r-1.  Removing  any  one  of  them,  say  X2,  one  gets  the  following 
literals:  X°,  X 1 , X3  , X*  ,...,  Xp_1.  Such  literals  will  be  called  allowed  literals.  Literal 
X2  is  univocally  created  as  1 © X°  © X1  © X3  © ...  © Xp_1.  It  can  be  proven  [29,30,31] 
that  for  a GRM  expansion  one  can  take  any  p-1  single-valued  literals,  and  moreover,  any 
p-1  literals  that  form  an  orthogonal  polarity  matrix.  For  instance  for  p=4  one  can  have  the 
following  set  of  allowed  literals:  {X01-2,  X0-1’3,  X0’2*3},  which  is  described  by  a polarity 
matrix: 


'1110' 

■ x0-1-2  ■ 

PM(X)  = 

110  1 

__ 

X0-1-3 

. 1 0 1 1 

X°.2.3 

(8) 


It  is  assumed  that  logic  value  1 (universe)  is  available.  This  corresponds  to  a literal  with  all 
possible  values,  which  in  turn  means  a row  of  all  ones  in  the  ’’expanded”  polarity  matrix. 

The  orthogonal  expanded  polarity  matrix  includes  also  a row  of  ones  which  corresponds 
to  the  universe  1.  For  the  above  example  the  expanded  polarity  matrix  is: 


EPM(X)  = 


1110 
110  1 
10  11 
1111 


X0-1*2  ' 

X0*1*3 

X°<2*3 

J£0,l,2,3 


X0'1'2  ' 
I0’1'3 

Jf0,2,3 

1 


(9) 


Let  us  observe  that  all  possible  literals  can  be  created  by  exoring  rows  of  this  matrix. 
The  Expanded  Polarity  Matrix  of  variable  X,  is  also  called  polarity  of  this  variable.  Let  us 
observe  that  there  are  the  following  expanded  polarity  matrices  of  binary  variables: 

r ^ t r . * 


POLARITY-2-1-A:  EPM(X)  = 
POLARITY-2-1-B:  EPM(X)  = 
POLARITY-2-2:  EPM(X)  = 


1 

X° 

1 

X1 

x° 

X1 


1 

1 

1 

0 


1 

0 


0 

1 
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Let  us  observe  that  when  all  variables  are  in  polarity  P0LARITY-2-1-B  the  function  is 
in  the  binary  Reed-Muller  form.  When  all  variables  are  in  polarity  POLARITY- 2-1- A the 
function  is  in  the  binary  Negative  Reed-Muller  form.  When  all  variables  are  in  polarity 
POLARITY-2-2  the  function  is  in  the  binary  canonical  AND-EXOR  minterm  form.  When 
each  variable  is  either  in  polarity  POLARITY-2-1-A  or  in  polarity  P0LARITY-2-1-B,  the 
function  is  in  a GRM  form.  Applying  expansions  of  variables  according  to  POLARITY-2-1- 
A or  polarity  POLARITY-2-2,  we  obtain  a new  (mixed  polarity)  canonical  form.  Similarly, 
applying  expansions  of  variables  according  to  POLARITY- 2-1-B  or  polarity  POLARITY- 
2-2,  we  obtain  another  new  (mixed  polarity)  canonical  form.  Applying  expansions  of 
variables  according  to  P0LARITY-2-1-A,  P0LARITY-2-1-B  or  polarity  POLARITY-2-2, 
we  obtain  the  Kronecker  Reed-Muller  canonical  form  [4,8].  The  concept  of  the  polarity 
matrix  will  allow  now  to  generalize  the  concept  of  canonical  trees  and  forms  to  the  logic 
with  multiple- valued  inputs. 

Any  form  in  which  all  variables  are  in  the  same  polarity  is  called  a Multiple- Valued 
Input,  Binary  Output  Restricted  GRMs  (MIRGRM)  form.  Such  form  is  canonical  since  the 
expansion  is  unique  for  each  of  its  variables.  It  can  be  shown  that  for  a logic  with  3- valued 
inputs  there  are  29  various  polarities,  and  29  MIRGRMs.  The  number  of  MIRGRM  forms 
for  a logic  with  p- valued  inputs  can  be  calculated  from  the  known  mathematical  results  on 
the  number  of  orthogonal  zero-one  matrices.  Assuming  that  universe  1 is  available  (which 
is  reasonable  for  practical  reasons),  expansions  that  use  row  of  ones  in  expanded  polarity 
matrix  are  more  interesting.  Under  such  assumption,  some  examples  of  sets  of  allowed 
literals  for  a 4-valued  input  variable  Y are:  {Y0,1,2,  Y®’1’3,  Y1,2,3},  {Y1,3,  Y2'3,  Y3}, 
{Y0,2,  Y0’1,  Y0,1’2},  {Y1,3,  Y2,3,  Y1,2,3}.  It  can  be  easily  checked  that  for  all  those  sets 
a complete  set  of  all  literal  values  can  be  obtained  from  other  allowed  literals  by  exoring 
rows  of  the  expanded  polarity  matrix.  There  are  examples  of  using  switching  functions 
with  such  literals  for  practical  circuits  such  as  adders  [29,30,31]. 

Reed-Muller  forms  are  extremely  easily  testable  [6,21,23].  We  have  proved  in  a forth- 
coming paper  that  also  all  the  generalized  binary  (and  even  multiple- valued)  Reed-Muller 
forms  discussed  here  have  very  good  testability  properties.  Among  MIRGRMs  for  4-valued 
logic  especially  preferable  is  the  form  which  corresponds  to  the  set  of  allowed  literals: 
{Y1>3,  Y2,3,  Y1,2’3},  since  the  decoder  is  very  simple  - a single  OR  gate:  x%  — Y1,3,  x2  = 
X 2-3,  xx  + x2  = Y1*2'3.  The  test  generation  for  this  form  is  easy  (it  uses  an  adaptation 
of  methods  from  the  literature) . It  minimizes  the  total  layout  area  comparing  to  other 
decoders,  because  of  small  area  of  the  OR  gate. 

Definition  3. 4.  The  set  of  allowed  literals  for  a p- valued  variable  Y is  a set  with  p — 1 
elements  whose  corresponding  polarity  matrix  is  orthogonal. 

Definition  3.5.  Allowed  literal  is  a literal  with  the  set  of  values  corresponding  to  a row 
of  an  orthogonal  polarity  matrix. 

Definition  3.6.  Polarity  vector  PV  = [PMU  PM2,  ...,PMn]  is  a vector  of  polarity 

matrices  of  input  variables. 

Definition  3.7.  By  a Multiple- Valued  Input  Kronecker  Reed-Muller  (MIKRM)  Expres- 
sion for  a polarity  vector  one  understands  an  exclusive  sum  of  products  in  which  all 
(multiple-valued)  literals  are  allowed  literals  for  this  polarity  vector. 
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It  can  be  proven  [31]  that  MIKRM  expression  is  canonical  (which  means  that  if  for  each 
variable  a single  polarity  is  selected,  then  there  exists  only  one  MIKRM  expression  for  this 
set  of  variables  and  their  corresponding  polarities).  It  results  directly  from  the  fact  that 
for  each  of  its  input  variables  there  exists  a unique  expansion.  Therefore  we  will  refer  from 
now  on  to  a MIKRM  form,  remembering  that  there  are  many  such  forms  for  a function. 
Since  for  ternary  variables  there  are  29  polarities,  there  are  29"  MIKRMs  for  a function  of 
n ternary  variables.  If  the  universe  is  not  a row  of  an  EPM(X)  then  all  terms  of  MIKRM 
need  to  include  literals  of  variable  X.  This  is  of  course  of  only  theoretical  interest,  since  in 
the  existing  technologies  the  universe  (logic  constant  1)  is  available  at  no  cost. 

As  we  can  see,  the  MIKRM  form  is  a generalization  of  the  concepts  of  GRM  and  KRM 
forms.  It  can  be  observed  that  there  are  no  separate  generalizations  of  GRMs  and  KRMs 
for  the  logic  with  multiple- valued  inputs. 

It  results  from  the  above  definitions  that  the  MIKRM  class  is  properly  included  in  the 
ESOP  class.  The  introduced  above  concepts  and  definitions  will  be  now  illustrated  with 
an  example. 

Example  3.3.  Assuming  4- valued  input  variables  X and  Y,  the  expression: 

,Y)  = 1 © x0’1,2  Y0,1’2  ® x0,1'3  Y2,3  © x1,2’3  Y2,3  © x0,2,3  Y1,3’3 

is  an  ESOP  but  it  is  not  a MIKRM  form  because  there  exists  the  variable  X that  has 
four  different  polarities,  while  only  three  polarities  are  allowed  for  it.  The  equivalent 
MIKRM  can  be  obtained  by  the  replacement  of  the  fourth  literal  of  variable  X by  an 
EXOR  combination  of  its  another  literals: 


/(X,  Y)  = 1 © (1  © X0-1’3  © X1’2’3  ©X0’2’3)  Y0-1-2  © X0’1’3  Y2’3  © X1’2’3  Y2’3  © X0-2-3  Y1-2’3. 


The  rule  X0'1,2  = 1 © X0,1,3  © X1,2,3  ©X0,2,3  can  be  written  as:  1 1 1 1 ©1 101  ©Oil  1 ©101 1 = 
1110.  By  using  the  ’’flattening”  Boolean  rule  the  expression  /(X,  Y)  can  be  now  converted 
to  the  exor  of  products  form: 


f(X,Y) 


Y 0,1,2 


X1,2,3  Y0,1,2  © 


X 0,2,3  y0,l,2  0 j^-0,1,3  y2,3  0 j^-1,2,3  y2,3  0 yli2*3 

As  we  can  easily  verify,  this  last  form  is  a MIKRM,  since  all  literals  are  now  allowed,  and 
'11011  [ 1 1 1 0 " 

PM(X)  = 0 111,  PM(Y)  = 0 0 11. 

. 1 0 1 1 J [0111. 


4 The  Orthogonal  Expansion  for  Multiple- Valued  In- 
put Switching  Functions 

Let  us  assume  that  a Boolean  function  in  a form  of  ESOP  is  represented  as  a list  of  terms 
called  ARESOP.  In  particular,  it  can  be  a set  of  minterms,  or  a set  of  disjoint  cubes 
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representing  both  a SOP  and  an  ESOP.  Our  algorithms,  however,  can  assume  arbitrary 
ESOP,  for  any  kind  of  orthogonal  expansion. 

Let  us  assume  that  given  is  a vector  of  expanded  polarity  matrices ; 

PV  = [PMl,  PMj,  ...,PMn]. 

To  perform  an  expansion  of  an  ESOP  ARESOP  with  respect  to  an  expanded  polarity 
matrix  PM(X,)  of  variable  X,,  one  has  to  convert  every  literal  of  variable  X,  in  all  cubes 
from  the  ARESOP  to  an  EXOR  combination  of  literals  that  are  allowed  for  this  polarity 
(a  universe  (a  vector  of  ones)  is  treated  as  an  allowed  literal  as  well).  If  a cube  CUB  of 
ARESOP  has  no  literal  X,-5  and  the  universe  is  absent  from  the  expanded  polarity  matrix, 
cube  CUB  should  be  first  represented  as  1 • CUB,  and  next  the  universe  1 from  it  should 
be  converted  to  the  EXOR  combination  of  literals  allowed  for  variable  X,.  (This  is  a 
generalization  of  a binary  rule  a = ab  © ab).  It  results  from  the  orthogonal  properties 
of  the  expanded  polarity  matrix  that  such  conversion  for  variable  X,-  exists  and  is  unique. 
Next  a one  level  of  flattening  is  executed  and  the  expression  is  rearranged  to  the  form  with 
all  allowed  literals  factorized.  Below  we  will  illustrate  the  expansion  on  an  example  of  a 
function  with  ternary  variables. 

Example  ^.J. 

1.  Given  is  a list  ARESOP  of  disjoint  cubes  corresponding  to  expression: 

X°,iyo  0 x'Y1'2  © X2Y2. 

2.  One  has  to  find  expansion  with  respect  to  variable  X,  with  allowed  literals  1,  X0’1, 
X0’2.  The  result  of  conversion  (substitution)  is:  X0,1Y°  © (1  © X0,2)Y1,2  © (1  © 
X0-1  )Y2. 

3.  After  flattening:  X^Y0  © Y1’2  © X^Y1'2  © Y2  © X0'1  Y2. 

4.  After  factorizing  the  allowed  literals:  X0-1^0  © Y2)  © X°’2(F1’2)  © 1(Y1,2  © Y2). 

In  the  next  stage  similar  expansion  is  done  for  variable  Y.  The  expansion  uses  the 
respective  expanded  polarity  matrix  EPM(Y). 

Two  computer-oriented  efficient  algorithms  to  perform  this  kind  of  expansion  for  flat 
forms  are  given  in  [29,30,31]  and  illustrated  with  examples  there.  They  do  not  use  flattening 
and  factorizing,  however,  they  cannot  be  also  applied  to  create  tree  expansions. 

Below  we  will  introduce  the  basic  concept  of  EXOR  type  Shannon  expansion  for  the 
mv  logic.  As  it  is  well-known,  the  Shannon  expansion  theorem  has  been  generalized  by 
Rudell  [24]  for  the  mv  logic.  His  expansion  is  of  AND-OR  type,  to  be  used  for  mv  SOP 
synthesis.  On  the  other  hand,  three  generalizations  of  Shannon  theorem  for  Boolean  rings 
are  known  [4,9]  (rules  5,6  and  7 in  section  2).  Here  we  will  formulate  an  expansion  that 
generalizes  both  Rudell’s  and  Davio’s  expansions:  it  is  in  terms  of  AND-EXOR  expansion, 
and  it  is  for  mv  logic. 
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It  can  be  derived  that  the  orthogonal  expansion  of  function  / with  respect  to  multiple- 
valued input  variable  X;  of  expanded  polarity  matrix  EPM(X,)  can  be  expressed  by  the 
following  formula: 

f = © fx,’,  X<Si  (10) 

XiSJ  e EPM(Xi) 

where  the  values  of  fxs}  are  calculated  as  follows:  [ fx.s,)T  = [/xJT  [NP]-1- 
[fxi si  1 *s  a vector  of  single-literal  orthogonal  expansions  of  literal  X{S} , j = 0,  ...,p  — 

1;  [/jfij]  is  a vector  of  single-literal  standard  expansions  of  single- value  literal 

Xi,  (Xi  = j),  j — 0, ... ,p  — 1;  [A]T  means  matrix  [>t]  transpose;  [.A]-1 

means  matrix  [A]  inverse;  [NP]  is  a normalized  polarity  matrix,  which  relates  polari- 
ties of  multiple- valued  input  literals  to  single- value  literals. 

Instead  of  proving  this  expansion  for  a general  case  we  will  sketch  the  proof  using 
another  example. 

Example  4-2.  The  expanded  polarity  matrix  for  ternary  variable  X,  is: 


EPM(Xi)  = 


- X0,2  ' 

' 1 

0 

1 ■ 

X0-1 

= 

1 

1 

0 

X2 

. 0 

0 

1 . 

(11) 


According  to  formula  10  the  orthogonal  expansion  for  EPM(Xt)  is: 

/ = fx,0’2  X,0-2  © fXi xi0A  © fx t*  Xi2.  (12) 

We  will  derive  the  values  of  fx. ;#,j  , fx  o,i  , and  fx  It  holds  for  non-overlapping  literals 
[24]:  / = fXifi  Xt°  + fXil  X^  + fxia  X{2  which,  with  respect  to  the  disjointness 
of  Xi,  X/,  s^r,  gives:  / = fx.0  X;°  © fXil  X,1  © fXi2  X{2.  Then: 

/ fxip  Xi  © fXi  j X,-  ffi  fx^  X,2  = f Xi0’ 2 X,0'2  © fx. o,i  X,0’1  © f x,2  X 2 (13) 
/ = (fxi°*  Xi°  © fx.oa  Xi2)  © {fx. o,i  Xi°  © fXi o,,  Xi1)  © fx.2  Xi 2 = (14) 

= (fx,0’2  © fx^’1 ) Xi°  © fx. X,-1  © (/x. o.2  © fXi*)Xi2 

In  matrix  form,  the  equation  13  becomes: 


(15) 


\Xi°] 

■ x.o,2  ■ 

[f Xi,o  f i f Xi%2  ] 

X1 

— [fxi0’2  fxi0’1  fxi2  ] 

X^’1 

Xi 2 

1 

to 

The  relation  between  the  disjoint  and  non-disjoint  literals  is  given  be  the  equation: 

X,0-2  1 

Xi0’1  = [NP] 

Xi2  J 

Substituting  16  to  15  one  obtains: 

, X.  ° 1 

[fxi, 0 fXi.i  fxia  ] 


x,°  ] 

' 1 0 1 ■ 

[X,°  ] 

X,1 

= 

110 

X/ 

. Xi2 . 

1 

o 

0 

t— 1 

1  

. Xi2 

(16) 


X 

X 


— [/x,0-2  fxi0’1  fxi2 


J— ‘ 

o 
1 

0 

1  

1 1 0 

X,1 

1 

o 

0 

1  

1 

c* 

1 

(17) 
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^From  there: 


[. fxi °'3  fxi0'1  fx{ 3 ] — 


= [fxi,  o /x,,,  fxia 


'10  1' 
1 1 1 
0 0 1 


[fxi, o /x,.,  /x,.2  ] 


= [ /Xj.o  © /x„, 


1 1 0 = 

[o  0 lj 

/x,,,  /x.,0  © /x.,,  © fx%  2 ] (18) 


Formula  18  gives  the  values  fxi*<7  > /x,0>1  > an<l  fxi 3 be  substituted  to  formula 

12  in  order  to  calculate  the  orthogonal  expansion  (End  of  Example). 

It  can  be  easily  checked  by  substituting  respective  expanded  polarity  matrices  to  for- 
mula 10  that  the  expansions  5 - 7 and  the  Rudell’s  expansion  are  particular  cases  of 
this  new  expansion.  This  method  can  be  also  easily  generalized  for  incompletely  specified 
functions. 


1.  The  orthogonal  expansion  applied  in  some  restricted  way  to  a multiple- valued  input 

ESOP  creates  a family  of  canonical  tree  expansions  analogous  to  those  for  binary 
logic.  _ 

2.  Applying  the  expansion  uniformly  in  a tree  for  a fixed  order  of  expansion  variables 
of  the  same  polarity  one  obtains  the  MIRGRM  Trees  that  are  the  mv  counterparts 
of  binary  Single  Polarity  Reed-Muller  Trees. 

3.  Applying  the  expansion  uniformly  in  a tree  for  a fixed  order  of  expansion  variables 
of  various  polarities  one  obtains  the  Multiple-  Valued  Kronecker  Reed-Muller  Trees 
(MIKRM  Trees ) that  are  the  mv  counterparts  of  binary  GRM  Trees  and  Kronecker 
Reed-Muller  Trees. 


4.  Applying  the  expansion  in  a tree  for  a fixed  order  of  expansion  variables,  but  having 
various  variable  polarities  in  different  sub-expressions  (sub-trees)  one  obtains  the 
Multiple- Valued  Pseudo- Kronecker  Reed-Muller  Trees  (MIPKRM  Trees)  that  are  the 
mv  counterparts  of  binary  Pseudo-Kronecker  Reed-Muller  Trees. 

5.  Applying  the  expansion  in  a tree  for  all  possible  but  fixed  orders  of  expansion  vari- 
ables, and  having  various  variable  polarities  in  different  sub-expressions  (sub-trees) 
one  obtains  the  Multiple-  Valued  Quasi- Kronecker  Reed-Muller  Trees  (MIQKRM  Trees) 
that  are  the  mv  counterparts  of  binary  Quasi-Kronecker  Reed-Muller  Trees. 

6.  Applying  the  expansion  in  a tree  for  all  possible  orders  of  expansion  variables,  having 
various  orders  in  various  sub-trees,  and  having  various  variable  polarities  in  different 
sub-expressions  (sub-trees)  one  obtains  a new  family  of  canonical  trees. 

7.  The  method  can  be  applied  with  little  modification  to  multi-output  functions:  it 
is  applied  to  each  function  separately.  The  logically  equivalent  sub-trees  are  be 
combined,  which  leads  to  DAG  circuits.  This  transformation  preserves  the  canonicity 
of  the  tree  circuits. 
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8.  The  trees  from  all  above  new  families  of  canonical  trees  can  be  flattened  to  respective 
canonical  mv  forms.  This  leads  to  MIRGRM  forms,  MIKRM  forms,  MIPKRM  forms, 
MIQKRM  forms,  and  new  mv  canonical  forms,  respectively. 

A more  detailed  characteristics  of  the  above  expansions,  new  mv  expansions  and  com- 
puter algorithms  to  create  them  will  be  included  in  our  forthcoming  paper. 

5 Conclusion 

In  this  paper  several  well-known  canonical  forms  have  been  generalized  for  the  logic  with 
multiple- valued  inputs.  An  Orthogonal  Expansion  Theorem  has  been  also  formulated, 
which  plays  that  fundamental  a role  in  those  expansions  as  one  played  by  the  Shannon 
Theorem  in  inclusive  logic  and  the  three  Boolean  ring  expansions  for  the  binary  forms. 
Since  the  Shannon  theorem  has  several  important  application  in  tautology,  complementa- 
tion, implicants  generation  and  many  other  areas,  and  the  ring  expansions  are  fundamental 
to  EXOR  circuits  theories,  we  expect  this  theorem  to  play  also  a fundamental  role  in  the 
multiple- valued  logic. 

The  reader  must  bear  in  mind  that  the  expansions  proposed  here  relate  to  trees  and 
not  ’’flat”  forms.  For  instance,  the  GRM  forms  are  independent  on  the  order  of  variables, 
but  the  respective  GRM  trees  do  depend  on  this  order.  Therefore,  investigating  expansions 
with  changing  the  order  of  variables  has  practical  sense  only  for  some  types  of  expansions. 
Since  several  expansions  obtained  by  changing  the  order  of  variables  produce  the  same 
’’flat”  form,  counting  of  several  forms  can  be  difficult,  as  already  observed  for  Quasi- 
Kronecker  forms  by  Green  [4] . It  is  even  more  so  for  our  forms,  where  different  orders  of 
variables  in  subtrees  are  possible. 
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Asynchronous  Sequential  Circuit  Design 
Using  Pass  Transistor  Iterative  Logic  Arrays 

M.  N.  Liu,  G.  K.  Maki  and  S.  R.  Whitaker 
NASA  Engineering  Research  Center 
for  VLSI  System  Design 
University  of  Idaho 
Moscow,  Idaho  83843 

Abstract  - The  Iterative  Logic  Array  (ILA)  is  introduced  as  a new  architecture 
for  asynchronous  sequential  circuits.  This  is  the  first  ILA  architecture  for 
sequential  circuits  reported  in  the  literature.  The  ILA  architecture  produces 
a very  regular  circuit  structure.  Moreover,  it  is  immune  to  both  1-1  and  0-0 
crossovers  and  is  free  of  hazards.  This  paper  also  presents  a new  critical  race 
free  STT  state  assignment  which  produces  a simple  form  of  design  equations 
that  greatly  simplifies  the  ILA  realizations. 

1 Introduction 

A major  goal  of  modern  Very  Large  Scale  Integrated  (VLSI)  design  is  to  produce  a structure 
that  consists  of  similar,  if  not  identical,  modules.  With  such  a structure  only  one  module 
needs  to  be  designed  and  then  can  be  replicated  to  realize  the  final  circuit.  Very  few 
design  procedures  have  been  advanced  for  sequential  circuits  that  produce  structures  that 
are  easily  realized  for  VLSI  implementations. 

This  paper  introduces  an  Iterative  Logic  Array  (ILA)  architecture  for  the  realization  of 
asynchronous  sequential  circuits.  With  the  ILA  architecture,  an  asynchronous  sequential 
circuit  can  be  built  in  a very  regular  form  with  a single  type  of  ILA  module  as  a build- 
ing  block.  Furthermore,  the  ILA  asynchronous  circuits  have  some  unique  features,  such 
as  immunity  to  both  1-1  overlapping  and  0-0  crossing,  tolerant  of  function  hazards  and 
immunity  to  input  bounce.  The  fundamental  mode  operation  is  still  required  in  the  ILA 
asynchronous  sequential  circuits. 

To  further  simplify  the  circuit  of  ILA  module,  a new  state  assignment  has  been  devel- 
oped to  generate  a new  form  of  design  equations  (the  simple  term  equation).  In  the  simple 
term  design  equations,  each  coefficient  is  simply  a state  variable  or  a constant,  instead 
of  the  random  sum-of-product  expression  found  in  a traditional  design  equation.  That 
simplifies  the  basic  ILA  module  to  a simple  pass  logic  multiplexer. 

2 ILA  Architecture 

Iterative  Logic  Arrays  (ILA)  has  been  described  in  the  literature  for  quite  some  time  [1,2]. 
An  ILA  circuit  consists  of  an  array  of  identical  cells.  Generally,  as  shown  in  Figure  1,  each 
ILA  cell  contains  two  sets  of  input  signals.  One  set  of  inputs  are  applied  in  parallel,  while 
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Figure  1:  A slice  of  ILA  circuit 
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Figure  2:  The  overall  ILA  architecture 


the  other  set^  of  inputs  are  driven  by  adjacent  cells.  Signals  normally  propagate  in  only 
one  direction  between  cells,  and  outputs  are  derived  only  from  the  serial  outputs  of  the 
last  cell.  This  paper  presents  an  ILA  architecture  for  sequential  circuits  in  which  the  next 
state  of  each  state  variable  is  generated  by  a slice  of  concatenated  ILA  cells.  A sequential 
network  is  then  constructed  by  placing  the  ILA  slices  side  by  side.  The  function  of  a flow 
table  is  implemented  by  interconnecting  ILA  cells  and  the  input  states. 

The  basic  cell  of  an  ILA  sequential  network  consists  of  a 2-to-l  multiplexer  (MUX)  and 
a next  state  forming  logic.  A MUX  cell  has  a select  line  5",  its  complement  7 and  two  data 
inputs  J0  and  I\,  such  that  Q = S * I\  + S * Iq. 

The  simplest  way  to  implement  the  MUX  function  is  to  use  a pass  transistor  circuit. 
Basically,  the  pass  transistor  MUX,  excluding  level  restoration  logic,  is  a module  of  two 
pass  transistors,  which  functions  as  two  simple  switches.  Design  considerations,  such  as 
level  restoration,  are  assumed  to  be  handled  by  the  output  buffers.  The  circuit  design 
considerations  have  been  discussed  in  [3,4,5]. 

The  overall  architecture  for  the  ILA  sequential  circuit  is  shown  in  Figure  2 which 
implements  the  next  state  equation 


F]  = fi\I\  + />2^2  + • • • + fin  Li  • 


(1) 
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Theorem  1 The  architecture  depicted  in  Figure  2 is  a proper  model  for  an  asynchronous 
sequential  circuit. 

Proof:  It  is  assumed  that  an  STT  assignment  is  used  and  the  logic  equations  are  defined 
in  Equation  1.  Let  the  present  input  be  Ip,  which  means  that  only  MUX  cell  p passes 
the  next  logic  state  fip  to  the  buffer.  Therefore,  Y = fipIp  for  each  y{.  When  input  Iq  is 
present  (Jp  = 0,  Iq  = 1),  the  only  MUX  that  passes  next  logic  state  is  cell  q.  Therefore 
Y = fiqlq  for  each  t/,.  Clearly,  the  architecture  realizes  Equation  1.  □ 

The  present  state  is  depicted  by  the  state  variables  and  is  fed  back  to  each  ILA  cell. 
The  logic  for  each  state  variable  Y consists  of  n ILA  cells  as  defined  in  Figure  2.  The  set 
of  n cells  is  described  as  a slice  that  realizes  a next  state  variable.  Referring  to  Figure  2, 
all  ILA  cells  that  are  driven  by  the  same  input  state  Ii  belong  to  the  same  level.  With  n 
input  states  and  m state  variables,  there  are  n levels  of  ILA  cells  and  m slices. 

3 Design  Procedure  for  Asynchronous  ILA  Sequen- 
tial Network 

As  a general  architecture,  the  ILA  architecture  can  be  used  to  realize  the  asynchronous 
design  equations  of  any  STT  assignment.  This  section  compares  the  ILA  circuits  for 
traditional  STT  assignments  and  proposes  a new  state  assignment  which  minimizes  the 
next  state  forming  logic  in  the  ILA  cell. 

3.1  Simple  Term  Design  Equations 

The  set  of  next  state  equations  provides  a mathematical  model  of  a sequential  circuit.  For 
example,  the  design  equations  using  Liu’s  assignment  for  Table  1 are: 

Ti  = yjx  + ( yxyi  4-  2/22/4)^  + yih 
Y2  = 2/1/1  + 2/2/2  + 2/3-F3 
*3  = 0 + 1/2/2  + 2/3/3 
T4  = 0 + 2/22/4/2  + 2/3/3 

If  an  ILA  network  is  utilized  to  implement  each  next  state  equation,  then  the  next  state 
forming  logic  /ip  in  Equation  1 would  be  a random  logic  function  in  sum  of  products  form. 
For  example,  f\ 2 would  be  2/12/2  + 2/22/4-  A circuit  to  realize  the  next  state  forming  logic  is 
simply  combinational  logic  and  can  be  formed  from  NAND  gates,  NOR  gates  or  from  pass 
transistors. 

Some  research  has  been  conducted  to  simplify  the  next  state  logic.  One  such  effort 
was  made  by  Chung-jen  Tan  [6],  The  Tan  assignment  will  produce  equations  in  a form  of 
so  called  simple  product  term  equation  in  which  each  coefficient  /tp  is  simplified  to  an  OR 
expression,  instead  of  a random  sum  of  products  expression. 

With  traditional  state  assignments,  each  /tp  term  in  resulting  design  equations  is  a 
function  of  the  partitioning  variables  of  input  Ip  [7]  and  the  number  of  partitioning  variables 
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Table  1:  A flow  table  and  traditional  Liu’s  assignment. 

is  equal  to  the  number  of  k-sets  in  Ip,  the  maximum  number  of  inputs  to  /,p  logic  is  equal 
to  the  number  of  k-sets.  Therefore,  a completely  regular  ILA  circuit  for  a traditional 
assignment  would  contain  a k-input  next  state  forming  logic  for  each  f,p  where  k= “number 
of  k-sets  in  input  Jp”. 

A goal  of  the  new  design  procedure  is  to  minimize  the  amount  of  hardware  required  by 
the  Up  logic.  Physically,  the  minimal  form  of  Up  logic  is  a wire.  The  design  equation  in 
which  all  fip  terms  are  in  minimum  logic  is  called  a simple  term  equation. 

Definition  1 A simple  term  next  state  equation  is  a design  equation  in  which  each  f,p 
coefficient  is  a single  state  variable  (or  its  complement ) or  a constant. 

In  a simple  term  equation,  each  coefficient  depends  on  at  most  one  state  variable. 
This  feature  has  significant  impact  on  the  hardware  implementation.  With  a simple  term 
equation,  only  a wire  is  required  to  generate  each  /tp  coefficient.  With  a procedure  discussed 
in  later  sections,  the  simple  term  equations  for  Table  1 can  be  derived  as  follows, 

Ti  = yih+ysh+yih 
Tj  = I\  -\-yi^i'>ry2^3 
Y3  = I\  +2/3I2+O 
Y*  = yih+Vih+yih 

As  all  coefficients  are  simple  terms,  no  extra  logic  is  required  to  generate  each  fip. 

3.2  New  State  Assignment 

In  this  research,  partition  algebra  is  used  to  derive  simple  term  equations  and  synthesize 
asynchronous  circuits.  The  new  state  assignment,  called  77  assignment,  is  proposed  to 
produce  the  simple  term  design  equations.  A relationship  between  r;f  partitions  and  /,p 
coefficients  has  been  presented  in  the  literature  [8],  Theorem  2 in  [8]  shows  part  of  the 
discovery  and  has  been  used  to  analyze  state  assignments  in  predicting  hardware. 

Theorem  2 (1)  If  r/f  = Tj,  then  /,p  = y_,  or  yj.  (ft)  If  all  the  states  of  rjf  are  in  one  block, 
then  fip  = 0 or  1. 
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If  all  fip  of  the  design  equations  meet  the  conditions  of  Theorem  2,  then  the  coefficients 
will  be  simple  terms.  A sufficient  condition  of  77-partitions  for  simple  term  fip  is  listed  in 
Corollary  1. 

Corollary  1 If  all  77?  partitions  satisfy  one  of  the  following  conditions: 

1.  Vi=Tj 

2.  rj f = {5;0} 

3.  77?  = {0;  S} 

then  the  design  will  yield  simple  terms,  where  S is  a set  of  all  flow  table  states. 

Proofs  The  proof  follows  directly  from  satisfying  Theorem  2. 

□ 

Most  STT  assignments  do  not  meet  the  conditions  of  Corollary  1.  However,  if  an 
assignment  can  be  generated  such  that  for  each  77-partition  rf-  under  input  Ip  where  77? 
does  not  satisfy  the  conditions  from  Corollary  1,  a new  T-partition  tj,  will  be  created 
where  r*  = rff  to  allow  fjp  = yk  and  produce  a simple  term  equation  for  Yj. 

For  example,  the  flow  table  shown  in  Table  1 has  eight  k-sets: 

{ABFG},  { AB },  {ACDFH},  { CDEH },  {< CE },  {DF},  {BEG},  { GH }. 

A set  of  r-partitions  can  be  formed  to  partition  each  k-set  from  the  rest  of  states  and  the 
results  are  rx  through  r8  in  Table  2 (a).  Then  from  the  corresponding  77-partitions,  it  can 
be  found  that  some  77-partitions,  such  as  r}\,  771, 77I  and  777,  are  not  equal  to  any  r-partitions 
or  a constant.  A new  T-partition  needs  to  be  formed  for  each  of  77-partitions  which  do 
not  meet  the  conditions  of  Corollary  1.  For  each  newly  created  T-partition,  corresponding 
77-partitions  need  to  be  formed.  The  new  r-partitions  may  generate  new  77-partitions  which 
do  not  meet  any  conditions  of  Corollary  1.  Additional  r-partitions  would  then  be  required. 
The  partitioning  procedure  will  continue  until  the  conditions  of  Corollary  1 are  met.  In 
the  case  of  assignment  for  Table  1,  r9,  T10,  txx,  rX2  are  formed  which  in  turn  generate  12 
corresponding  77  partitions.  The  results  are  shown  in  Table  2 where  all  77-partitions  are 
equal  to  a r-partition  or  a constant. 

Simple  term  equations  can  be  generated  once  all  77-partitions  satisfy  Corollary  1.  In 
most  cases,  however,  more  r-partitions  (state  variables)  than  necessary  have  been  intro- 
duced and  can  be  removed  without  jeopardizing  the  simple  term  feature. 

Definition  2 A T-partition  t,  is  redundant  if  Ti  and  77?  for  all  Ip  can  be  eliminated  while  the 
resulting  next  state  equations  remain  the  simple  term  equations  and  the  state  assignment 
remains  a critical  race  free  STT  assignment. 

Theorem  3 In  the  set  of  r-partitions  resulting  from  the  77  assignment,  if  a T{,  which 
partitions  a k-set  Kt  in  Ip  from  other  states  is  not  equal  to  any  rjj  or  its  logical  complement, 
where  i ^ j , then  Ti  is  redundant. 
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n = {ABFG-,  CDEH} 
T4  = {CDEH]ABFG} 
t7  = {BEG]  ACDFH} 
t10  = {ABDF\CEGH} 


tjI  = {ABFG;CDEH} 
t)1  = {ABFG-,  CDEH} 

Til  = {ABFGCDEH]  -} 
q\={CDEH]ABFG} 
q\  = {-]  ABFGCDEH} 
tjI  = {CDEH]ABFG} 
q]  = {-]  ABFGCDEH} 
ql  = {-]  ABFGCDEH) 
ql  = {ABFG]  CDEH} 
q\0  = {ABCDEFGH]  -> 
q\t  = {-]  ABCDEFGH} 
q\2  = {CDEH]  ABFG} 


t-3  = {AB]  CDEFGH} 
r5  = { CE-ABDFGH } 
x8  = {GH-,  ABCDEF} 
tii  = {CEGH]  ABDF} 

(a)  x-partitions 

77*  = {AR.D.F;  CEGH} 
q]  = {AB]  CDEFGH} 
ql  = {DFGH]  ABCE} 
q\  = {CEGH-,  ABDF} 
ql  = {CE]ABDFGH} 
ql  = {DF]  ABCEGH} 
ql  = {ABCE]  DFGH} 
ql  = {GH;  ABCDEF} 
ql  = {ABCE]DFGH} 
q\Q  = {ABDF]  CEGH} 
q2n  = {CEGH,  ABDF} 
q\%  = {DFGH]  ABCE} 

(b)  77-partitions 


r3  = {ACDFH]BEG} 
r6  = {DF]  ABCEGH} 
rg  = {ABCE]  DFGH} 
t12  = {DFGH]  ABCE} 


q\  = {BEG]  ACDFH} 
q\  = {-;  ABCDEFGH} 
ql  = BEG} 

ql  = {^CZ?™;  SEG} 
ql  = {ACDFH]  BEG} 
ql  = {-]  ABCDEFGH} 
ql  = {BEG]  ACDFH} 
q\  = {BEG]  ACDFH} 
ql  = {ACDFH]  BEG} 
T7j0  = {—]  ABCDEFGH} 
q3n  = {ABCDEFGH]-} 
ql2  = { J3  F (7;  AC  D F H} 


Table  2:  The  x and  q partitions 

Proof:  If  the  k-set  Kt  in  Ip  is  the  left  block  of  X;  and  the  x;  is  not  equal  to  any  77?  or 
its  logical  complement,  for  all  z j,  then  for  every  Iq,q  p,  there  must  exist  a k-set  Kk 
under  Ip  in  the  right  block  of  tj  such  that  the  stable  state  of  Kt  and  the  stable  state  of  Kk 
are  in  the  same  k-set  in  Iq.  Hence  the  k-set  Kt  does  not  need  to  be  partitioned  from  k-set 
Kk  for  the  input  transition  Ip  to  Iq.  Moreover,  the  partitioning  variable  7/;  is  not  needed 
to  generate  any  simple  term  equation  Yj.  Therefore,  t;  is  redundant. 

□ 

For  example,  in  Table  2,  x2,x6,x6  and  r8  do  not  appear  in  77-partition  set  except  for 
vliVhvh  and  Vs-  Those  r-partitions  are  redundant  according  to  Theorem  3.  Another  way 
of  looking  at  this  is  that  y2  only  appears  in  the  equation  for  Y2  and  no  other  - same  for  y5, 
y6  and  y8.  Therefore,  they  are  redundant. 

Theorem  4 J/r,-  is  a logical  complement  of  t2>  then  yi  can  be  replaced  by  yj  in  all  design 
equations. 

Proof:  If  Ti  is  the  logical  complement  of  Xj,  then  T{  and  Tj  partition  the  same  k-sets. 
Only  one  of  them  is  needed  to  meet  the  partitioning  condition  for  a critical  race  free  STT 
assignment.  Moreover,  according  to  the  Rule  2 in  Theorem  7,  all  coefficients  where  fkp  = yi 
can  be  replaced  by  fkp  — y J- 
□ 
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rj  = {ABFG]  CDEH } r3  = {ACDFH-,  BEG} 

r10  = {ABDF\  CEGH}  r13  = {DFGff  ; ABCE } 

(a)  r-partitions 

Vi  = Tio  rjl  = r7 

^3  = ri2  y3  - r3 

Vlo  = Tio  J7?o  = 0 

Vl2  = r12  *7?2  = r7 

(b)  77-partitions 

Table  3:  The  reduced  r and  rj  partitions 

Theorem  4 provides  the  condition  for  removing  another  type  of  redundancy.  For  ex- 
ample, 74  in  Table  2 is  the  logical  complement  of  ri.  Therefore,  equation  I4  is  redundant 
equation  and  all  coefficients  y4  in  the  simple  term  equations  can  be  replaced  by  yl.  Sim- 
ilarly, equation  Yj  is  also  a redundant  equation  and  all  coefficients  y7  in  the  simple  term 
equations  can  be  replaced  by  y J. 

Theorem  4 does  not  specify  which  r-partition  should  be  removed  if  t;  is  a logical  com- 
plement of  Tj.  The  final  result  may  be  quite  different  if  rj  or  Tj  is  removed  because  removing 
such  a r-partition  may  make  more  r-partitions  become  redundant  with  the  condition  of 
Theorem  3.  In  order  to  obtain  a better  result,  the  redundancy  specified  by  Theorem  3 
first  needs  to  be  removed.  That  will  allow  the  designer  to  make  a better  choice  by  check- 
ing  fewer  r and  77-partitions.  For  example,  T9  and  Tn  become  redundant  partitions  after 
removing  r4,r7  and  corresponding  77-partitions  y£,  777,Vp  = 1,2,3. 

By  eliminating  redundant  r-partitions  and  77-partitions,  the  number  of  partitions  is 
significantly  reduced.  In  the  example  above,  8 out  of  12  r partitions  are  removed.  Table  3 
(a)  and  (b)  show  the  results. 

By  using  Theorem  3,  two  stable  states  under  the  same  input  state  may  have  an  identical 
state  assignment  if  they  have  the  same  next  states  under  all  inputs.  To  have  identical 
assignment  for  two  stable  states  is  equivalent  to  merge  two  corresponding  rows  in  the  flow 
table.  However,  if  the  outputs  associated  with  these  two  stable  states  are  not  compatible, 
such  a merging  is  invalid.  A unique  state  has  to  be  assigned  to  each  stable  state  so  that 
two  outputs  can  be  distinguished. 

A partitioning  chart  is  introduced  to  facilitate  state  assignment  reduction  and  avoid 
such  invalid  merging.  The  chart  has  intersection  for  each  pair  of  k-sets  in  an  input.  For 
each  T-partition  Tj  introduced  by  the  state  assignment,  Tj  will  be  placed  in  the  intersec- 
tions where  a pair  of  k-sets  are  partitioned  by  Tj.  When  a Tj  is  removed  by  the  state 
assignment  reduction  procedure  in  accordance  with  Theorem  3,  Tj  must  be  removed  from 
all  intersections  in  the  partitioning  chart. 

If  a blank  intersection  in  a partitioning  chart  was  left  once  Tj  was  removed,  then  the 
compatibility  of  two  corresponding  rows  in  the  flow  table  would  have  to  be  checked.  If 


^i1  = Ti 
^ = 1 
Vlo  = 1 
Vll  = T4 
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Figure  3:  The  partitioning  chart  for  the  final  r partitions 

two  outputs  are  not  compatible,  Tj  should  not  be  removed  even  though  tj  is  redundant 
according  to  Theorem  3.  The  partitioning  chart  for  the  final  rj  assignment  is  shown  in 
Figure  3.  There  is  at  least  one  r-partition  at  each  intersection. 

The  following  procedure  formalizes  the  rj  assignment  assignment. 

Procedure  1 tj  assignment  generation: 

Step  1.  Select  an  initial  set  of  r-partition  T{  that  partition  each  k-set  K,  from  the  rest  of 
states.  Create  a partitioning  chart  for  k-sets. 

Step  2.  Generate  all  rj-partitions  rj?  for  each  input  Ip. 

Step  3.  For  any  tj f that  do  not  meet  the  conditions  of  Corollary  1,  generate  a new  r- 

partition  Tk  such  that  Tk  = rjf.  Add  Tk  to  the  partitioning  chart.  Return  to  step  2. 

The  state  assignment  process  is  complete  when  every  rfi  satisfies  the  conditions  of 
Corollary  1. 

Step  4.  Remove  each  r-partition  r,-  that  meets  the  conditions  of  Theorem  3 if  there  are  no 
merging  problem  in  the  partitioning  chart  by  removing  t,-.  Also,  for  each  T{  removed, 
remove  r/f  under  all  Ip.  Repeat  Step  4 until  all  such  Tj  have  been  eliminated. 

Step  5.  For  each  pair  of  r -partitions  r,  and  Tj  that  are  logical  complements,  remove  Tj 

and  rj-partitions  rjf  under  all  Ip.  Once  such  a r ; is  removed,  return  to  Step  4- 

Since  Procedure  1 involves  iterations  of  step  2 and  step  3,  it  is  useful  to  know  if  the 
number  of  r-partitions  is  finite.  The  following  theorem  shows  closure  of  Procedure  1. 

Theorem  5 The  number  of  r-partitions  needed  to  generate  any  arbitrary  tj  assignment  is 
finite. 
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I\  h h Vi  2/3  2/io  2/12 

A A B C~  1110 

BABG  1010 

CDEC  010  0 

D D F C 0 111 

EDEG  000  0 

FAFC  1111 

G A H G 10  0 1 

H D H C 0 10  1 

Table  4:  The  result  of  rj  assignment 

Proof:  If  there  are  k k-sets  in  a column  of  Jp,  the  maximum  number  of  r-partitions  that 
could  be  generated  is 

(5) + (J)  + (^) + ••■  + (£)  = 2*. 

This  is  finite. 

.□ 

Corollary  2 A simple  term  rj  assignment  exists . 

Proof:  A flow  table  must  have  at  least  one  input  state  with  more  than  one  k-set,  otherwise 
the  circuit  is  purely  combinational.  With  k k-set  (k  > 2),  Theorem  5 shows  that  the 
maximum  of  r-partitions  is  2k.  Therefore,  an  assignment  exists  because  a set  of  r-partitions 
can  be  constructed. 

□ 

Theorem  6 The  rj  assignment  is  a valid  STT  assignment . 

Proof:  A Liu  assignment  will  produce  r-partitions  which  partition  the  k-sets.  The  r- 
partitions  in  this  assignment  consists  of  two  sets  of  partitions.  The  first  consists  of  the 
initial  set  of  r-partitions  that  partitions  individual  k-sets.  The  second  set  consists  of  the 
r-partitions  that  are  created  from  ^-partitions  that  do  not  meet  conditions  of  Corollary  1. 
Hence,  both  type  of  r-partitions  are  elements  of  a valid  Liu  assignment  and  therefore  the 
overall  assignment  meets  the  conditions  of  an  STT  assignment. 

□ 

The  result  of  77  assignment  for  flow  table  Table  1 is  given  in  Table  4. 

3.3  Design  Equation  Generation 

With  the  rj  assignment,  the  next  design  step  is  to  associate  a unique  state  variable  y;  with 
each  r-partition  r,  and  determine  each  fip  term. 

Theorem  7 The  next  state  equations  can  be  derived  from  the  ^-partitions  with  the  follow- 
ing roles: 
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1.  If  rjf  = Tj,  then  fip  - yj. 

2.  If  the  states  in  the  left  block  of  rfi  are  the  same  as  states  in  the  right  block  of  Tj 
and  vice  versa , then  fip  = yj. 

3.  If  all  the  states  of  rfi  are  in  the  left  block,  then  fip  = 1. 

4.  If  all  the  states  of  tj?  are  in  the  right  block,  then  fip  = 0, 

Proof;  The  proof  follows  directly  from  Theorem  2 since  the  77  assignment  assigns  state 
variable  yj  — 1 to  states  in  the  left  block  and  yj  = 0 to  states  in  the  right  block  of  t- 
partition  Tj. 

□ 

The  following  example  gives  the  next  state  equations  by  applying  Theorem  7 to  the 
result  of  7 1 assignment  in  Table  3. 

Example  1 Applying  the  revised  conditions  of  Theorem  7 to  each  rj-partition  rjf  in  Table  S, 
the  next  state  equations  can  be  derived  as  follows: 

Y\  = 3/1  /1 +2/10/2+2/3/3 
T3  = I\  +2/12/2+2/3/3 
Yio=  I\  +3/10/2+O 
yi2rr  57/1  + 3/12/2+^3/3 

The  T)  assignment  guarantees  to  produce  the  simple  term  equations  in  which  all  co- 
efficients are  single  variables  or  constants.  The  general  form  for  the  next  state  equation 
is 

Yi  = E 

P=1  

It  has  been  shown  in  [10]  that  this  equation  can  be  represented  as  a pass  logic  expression. 
Since  only  one  Ip  = 1 at  a time  and  the  case  when  all  I;  = 0 is  an  undefined  situation  in 
an  asynchronous  flow  table,  an  ILA  network  will  let  j/i  be  passed  to  Yj  if  the  term  In(yi ) is 
added.  The  equation  of  Y then  becomes 

Yi  = Wn)  + £(•  • • WjF)  + Tp(' . • (J„(/j„)  + 7n(y»)  • • •)•  (2) 

The  advantage  of  this  equation  is  to  allow  Y to  maintain  the  same  state  when  all  Jj  = 0, 
which  can  happen  during  an  input  transition.  For  example,  the  simple  term  equation  for 
Y\  in  Example  1 is 

Yi  = yi  I\  + 2/10-I2  + y$Iz 

Putting  this  into  the  form  of  Equation  2,  the  expression  can  then  be  converted  into 

Yi  = h{yi)  + J7(l2(yio)  +77(13  (pi)  + h(yi))) 

Similarly,  all  other  equations  can  be  converted  to  the  pass  logic  expressions  in  the  same 
way.  The  results  are  as  follows: 
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Figure  4:  ILA  realization  of  state  variable  Yi 

^1  = ^l(yi)  +-|l_(-f2(yio)-(-f2(f3(p3)  +-^3(1/1))) 

^3  = -fi(l)  +h(h{yi2)^Ii{h(y3)  +h{y3))) 

Yio=  -fi(l)  +-fi(fj(yio}b-f2(-f3(0)  +/3(yio))) 

Yu=  Ii{yl)  +Ii{l2(yi2yjrl2(h(y3)  +-f3(yi2)))- 

Obviously,  the  next  state  logic  /,p  is  minimized  to  a wire  if  the  set  of  next  state  equations 
are  all  simple  term  equations.  It  is  straightforward  then  to  map  the  design  equations  to 
an  ILA  network.  Figure  4 shows  an  ILA  realization  of  state  variable  Y\. 

4 Input  and  Hazard  Characteristics 

In  addition  to  the  regularity  of  the  ILA  network  with  the  77  assignment,  an  ILA  realization 
has  other  features  such  as  (1)  immunity  to  input  1-1  overlapping  and  0-0  crossing,  (2) 
immunity  to  input  bounce  while  1-1  overlapping  is  not  present,  and  (3)  free  of  transition 
path  hazards  and  input  state  transition  hazards. 

Potential  conflicts  arise  in  asynchronous  sequential  networks  when  more  than  one  input 
state  is  present  at  a time  (1-1  overlapping),  or  when  none  of  the  input  states  are  active 
(0-0  crossing).  The  overlapping  and  crossing  situations  occur  because  two  input  states 
rarely  switch  at  exactly  the  same  time  due  to  differences  in  delay  through  forming  logic. 
Most  design  procedures  avoid  such  uncertainties  by  setting  a constraint  either  to  forbid 
1-1  overlapping  or  to  forbid  0-0  crossing  of  input  states.  Another  form  of  hazard  on  an 
input  signal  which  may  cause  the  circuit  to  malfunction  is  the  dynamic  hazard  when  input 
transitions. 

Theorem  8 The  ILA  architecture  tolerates  1-1  input  overlapping. 

Proofs  In  Equation  2,  assume  the  input  state  Ip  have  higher  priority  relative  to  input 
state  Iq.  In  the  other  words,  Ip  is  the  control  variable  of  an  ILA  cell  which  is  closer  to  the 
output  than  the  ILA  cell  with  the  control  variable  Iq.  If  both  Ip  and  Iq  are  asserted  1, 
then  Yi  will  assume  value  specified  by  fip  rather  than  fiq. 

In  the  case  where  the  input  switches  from  Ip  to  Iqi  when  Ip  is  1,  it  passes  f{p  to  Yi  and 
meanwhile  cuts  off  the  path  of  fiq  to  Y*,  no  matter  if  Iq  is  set  to  1 or  0.  With  such  an 
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architecture,  the  1-1  overlapping  of  input  Ip  and  Iq  has  the  same  effect  on  the  output  Yx 
as  Ip  = 1,  Iq  = 0. 

In  the  case  where  the  input  switches  from  Iq  to  Ip,  when  Iq  = 1,  all  Yi  are  determined 
by  the  values  which  are  propagated  through  the  ILA.  Once  Ip  = 1,  the  /,p  values  are 
passed  through  the  ILA  independent  of  whether  Iq  is  0 or  1.  Hence  the  circuit  assumes 
the  proper  next  state  value. 

□ 

For  example,  in  Figure  4,  if  Ix  and  J2  are  active,  the  circuit  for  all  next  state  variables 
will  be  under  the  control  of  Ix  only,  and  Yx  will  assume  the  value  of  yx.  yxo  can  be  passed 
to  Y\  only  if  Ix  = 0 while  I2  is  active.  Once  Ix  is  set  to  1 again,  the  value  of  y10  at  Yx  will 
become  yx  immediately  (assuming  is  set  to  0 simultaneously). 

The  0-0  crossing  happens  when  all  input  states  are  0.  Let  the  input  change  from  Ip  to 

If  input  Ip  goes  to  0 before  Iq  goes  to  1,  there  will  be  a period  that  all  input  lines  are  0. 
In  traditional  design,  the  outputs  of  the  design  equations  could  assume  an  undeterminant 
state  when  all  inputs  are  0.  The  ILA  architecture  solves  the  problem  by  providing  a path 
for  each  state  variable  to  pass  y;  to  Yi.  When  all  inputs  are  0,  it  allows  Yi  to  maintain  its 
current  value  yx. 

Again,  Figure  4 can  be  used  to  show  the  feature  of  0-0  crossing  tolerance.  For  example, 
assuming  the  input  transition  is  from  J2  to  I3,  when  I2  = 1,  yio  is  passed  to  YXt  and  yx 
is  fed  back  into  the  last  ILA  cell  under  J3.  Once  the  network  is  stable,  the  level  of  y10  is 
passed  to  Yx.  When  I2  is  set  to  0,  the  path  provided  by  Ix,  /2  and  I3  will  still  maintain 
the  level  of  Yx  at  the  current  value  of  yx.  The  output  of  the  network  remains  unchanged 
during  the  period  when  all  inputs  are  0 until  I3  is  set  to  I.  Then  a new  level  of  yx  will  be 
passed  to  Yx  and  the  network  assumes  a new  state. 

An  input  bounce  can  be  considered  a dynamic  hazard  during  the  transition  of  input 
states.  With  the  ability  to  tolerate  input  bounce,  the  ILA  network  allows  extra  input 
transitions  to  occur  before  the  circuit  is  stablized.  However,  the  input  bounce  can  be 
tolerated  only  if  it  is  not  necessary  to  also  tolerate  1-1  overlapping.  If  an  input  Iq  bounce 
occurs  when  two  input  states  Ip  and  Iq  are  overlapping,  an  ambiguity  is  created  regarding 
the  interpreration  of  the  transition;  it  could  be  one  transition  from  Ip  to  Iq  or  three 
transitions  from  Ip  to  Iq  then  to  Ip  back  to  Iq.  In  order  to  avoid  the  ambiguity,  it  is 
assumed  that  during  the  input  state  transition  from  Ip  to  Iq  the  circuit  tolerates  either  the 
bounce  condition  or  the  1-1  overlapping,  but  not  both. 

Theorem  9 For  the  simple  term  equations  derived  from  an  77  assignment , all  partitioning 
variables  under  input  Ip  do  not  change  as  the  circuit  transitions  from  one  state  to  another 
under  Ip. 

Proofs  The  77  assignment  produces  the  simple  term  equations  of  the  form 

Yi  = — h yjp  4 

where  y<  is  the  partitioning  variable  of  Ip.  yx  is  the  only  partitioning  variable  for  all 
transition  paths  in  the  rj  assignment.  According  to  Theorem  2 in  [7],  y x will  not  change  in 
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Y{  = yJi  + yjh 
Yj  = 0 + y5I2 


Figure  5:  A traditional  realization  of 


a partial  flow  table 


any  transition  under  Ip.  Hence,  the  partitioning  variables  of  the  simple  term  equations  do 
not  change  during  the  input  transitions. 

□ 

From  Theorem  9,  the  partitioning  variables  do  not  change  when  an  input  signal  Ip 
undergoes  a bounce.  In  the  case  that  a dynamic  hazard  presents  when  Ip  transitions  from 
0 to  1,  the  circuit  begins  to  transition  to  the  new  state  when  Ip  goes  to  1.  Meanwhile,  the 
partitioning  variables  of  Ip  remain  stable.  Since  the  partitioning  variables  determine  all 
fip  values  [7]  and  since  the  partitioning  variables  are  uneffected  by  the  input  bounce,  the 
circuit  will  assume  the  proper  next  state  value  when  Ip  is  stablized. 

In  the  case  that  a dynamic  hazard  presents  when  Ip  transitions  from  1 to  0,  the  present 
state  is  simply  passed  to  the  next  state  once  Ip  goes  to  0,  as  this  represents  a 0-0  crossover 
condition.  The  circuit  does  not  transition.  When  Ip  returns  to  1 on  the  bounce  condition, 
again  from  Theorem  9,  there  will  be  no  transition  in  partitioning  next  state  variables. 
Since  the  circuit  is  a function  only  of  the  partitioning  variables  [7],  output  of  the  circuit 
remains  unchanged. 

A critical  race  free  asynchronous  sequential  network  may  still  malfunction  due  to  un- 
wanted switching  transients  in  the  combinational  circuit.  A transition  path  hazard  is  one 
that  is  present  within  the  states  of  a transition  path.  The  simple  term  equations  for  the 
next  state  variables  have  the  form 


Y{  = ■ • • + yiIp  + • • • 

where  yf*  is  the  partitioning  variable  of  Ip.  From  Theorem  9,  partition  variable  t/;  does 
not  change  as  the  circuit  transitions  from  one  state  to  another  in  Jp.  In  general,  since  the 
partitioning  variables  of  the  simple  term  equations  do  not  change  during  the  transition, 
it  is  impossible  for  a hazard  of  any  kind  to  occur.  Therefore,  there  cannot  be  any  static, 
dynamic  and  function  hazards  that  occur  during  the  transition  between  unstable  and  stable 
states. 

A circuit  free  of  static  and  dynamic  hazard  may  have  a hazard  problem  caused  by  the 
change  of  more  than  one  variable  in  the  design  equation.  One  type  of  such  multi- variable 
change  hazard  is  a function  hazard.  The  problem  can  be  illustrated  by  the  partial  flow 
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table  in  Figure  5 where  design  equations  for  Yi,  Yj  and  a schematic  for  Yi  are  shown.  A 
hazard  exists  when  an  input  changes  from  J2  to  Ij.  If  the  delay  of  AND  gate-1  is  longer 
than  the  total  delays  of  AND  gate-2,  the  OR  gate  and  the  buffer,  then  the  output  of  gate-1 
will  remain  0 after  the  input  changes  to  I\  = 1 and  J2  = 0.  That  will  in  turn  cause  y; 
to  remain  at  0 and  lock  the  output  of  gate-1  to  0.  The  result  is  that  state  variables  yiyj 
are  set  to  00  instead  of  intended  value  10.  Moreover,  the  circuit  is  locked  up  in  the  wrong 
state  until  a reset  line  is  provided. 

The  a problem  occurs  in  a traditional  design  where  a slow  gate  has  the  same  effect  as 
a 0-0  crossover.  Input  Ip  — 0 will  set  the  product  term  to  0.  The  ILA  circuit  solves  the 
problem  by  maintaining  the  same  state  when  all  inputs  go  to  0.  Also,  the  1-1  overlapping 
problem  which  may  arise  in  the  traditional  design  due  to  a slow  gate  can  be  tolerated 
by  ILA  circuit.  The  1-1  overlapping  and  0-0  crossover  properties  of  the  ILA  architecture 
prevents  the  circuit  from  malfunction  due  to  such  input  state  transition  hazards. 
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Pulse  Mode  VLSI  Asynchronous  Circuits 

Q.  Chen  and  G.  Maki 
NASA  Space  Engineering  Research  Center 
for  VLSI  System  Design 
College  of  Enginering 
University  of  Idaho 
Moscow,  Idaho  83843 

Abstract  - A new  basic  VLSI  circuit  element  is  presented  that  can  be  used 
to  realize  pulse  mode  asynchronous  sequential  circuits.  A synthesis  procedure 
is  developed  along  with  an  unconventional  state  assignment  procedure.  Level 
input  asynchronous  sequential  circuits  can  be  realized  by  converting  a regular 
flow  table  into  a differential  mode  flow  table,  thereby  allowing  the  new  syn- 
thesis technique  to  be  general.  The  new  circuits  tolerate  1-1  crossovers.  This 
circuit  also  provides  a means  for  state  sequence  detection  and  real  time  fault 
detection. 

1 Introduction 

Many  asynchronous  sequential  circuits  can  be  modeled  as  a pulse  mode  circuit  since  the 
inputs  are  presented  in  the  form  of  pulses  [1],  Level  input  sequential  circuits  can  be 
modeled  as  a pulse  mode  circuit  by  detecting  input  state  changes  [2].  This  work  presents 
a basic  circuit  that  can  used  to  realize  state  variables  that  are  effective  in  the  realization 
of  pulse  mode  circuits. 

Sequential  circuits  are  normally  defined  in  terms  of  flow  tables,  such  as  shown  in  Table 
1.  The  inputs  are  shown  across  the  top  and  the  states  along  the  side.  The  states  are 
encoded  with  internal  state  variables  7/j.  Next  state  variables  Y{  identify  the  next  state 
that  the  circuit  will  assume. 

This  paper  presents  a VLSI  circuit  element  that  allows  for  efficient  realizations  of  pulse 
mode  asynchronous  sequential  circuits.  The  network  consists  of  pass  transistor  next  state 
forming  logic  with  a unique  buffer. 

The  paper  describes  the  following: 

• Synthesis  procedures  for  pulse  mode  asynchronous  sequential  circuits. 

• State  assignment  procedure  for  differential  mode  asynchronous  sequential  circuits. 

• Tolerance  of  1-1  input  crossover  situations.  (This  circuit  is  designed  to  tolerate  0-0 
input  crossover  situations  also.) 

• State  sequence  detection. 


• Real  time  fault  detection. 
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2 Pulse  Mode  Circuits 

The  next  state  equations  can  be  expressed  as  follows[3]: 

Yi  — full  + fall  • • • + finln  (1) 

where  Yi  is  next  state  variable,  Ip  is  the  input  state  and  /«>  is  a sum-of-products  expression 
of  state  variables.  It  has  been  shown  that  the  next  state  equations  can  be  expressed  as  a 
pass  logic  expression[3]: 

Yi  = h(fil)  + I2(fi2)---  + In(fin)  (2) 

where  IP(fiP)  means  input  Ip  passes  function 

The  basic  circuit  to  implement  pulse  mode  circuits  is  shown  in  Fig.  1,  Each  state 
variable  is  realized  with  this  circuit.  If  there  are  m state  variables,  then  there  would  be  m 
such  circuits  except  that  there  is  only  one  NOR  gate. 


Figure  1:  Next  State  Circuit  Module 


In  pulse  mode  operation,  all  Ip  could  be  0.  When  all  input  states  Ip  are  0,  the  paps 
networks  /;p  are  disabled  and  hence  are  tristated  from  the  inverter  input  of  the  first  stage. 
The  feedback  inverter  in  the  first  stage  is  provided  to  sustain  the  value  at  point  A of  the 
first  stage.  However,  the  feedback  inverter  consists  of  weak  devices  that  can  be  overdriven 
by  the  /p(/,p)  networks.  The  same  kind  of  inverter  is  placed  in  the  second  stage  of  the 
circuit  after  transistor  T. 

For  pulse  mode  operation,  assume  one  and  only  one  input  state  Ip  is  1 at  a time  or 
all  Ip  are  0.  In  other  words,  only  one  input  pulse  is  present  at  a time.  When  Ip  = 1, 
pass  network  /tp  presents  the  proper  next  state  value  to  Yi  as  specified  in  Eq.  2 for  Yi  to 
the  input  of  the  inverter  at  point  A.  The  feedback  inverter  is  composed  of  weak  pullup 
and  pulldown  transistors  such  that  they  can  be  overdriven  by  the  value  passed  by  Jp(/,p). 
Therefore  the  correct  next  state  value  as  defined  by  Eq.  2 is  present  at  point  A in  Fig.  1 


tu’ii i ir  i in  niKini 
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and  Y{  contains  the  complement  of  Y{,  When  all  Ip  = 0,  transistor  T is  enabled  and  Y)  is 
passed  to  Y{  and  the  circuit  assumes  the  proper  next  state. 

To  summerize,  when  one  Ip  = 1,  Y{  assumes  the  complement  of  the  proper  next  state 
value  of  Y{  as  defined  by  Eq.  2.  When  all  Ip  ~ 0,  Y{  is  passed  to  the  second  stage  of  the 
inverter  and  Y assumes  the  value  defined  by  Eq.  2.  The  new  present  state  feeds  back  to  the 
fip  networks  to  generate  the  new  next  state  values  to  the  first  stage,  dependent  on  which 
Ip  = 1.  An  interesting  observation  can  be  made  which  is  common  to  all  asynchronous 
sequential  circuits,  but  perhaps  is  more  easily  seen  here.  When  all  Ip  = 0,  the  present 
state,  as  determined  by  present  state  variables  j/;,  feed  back  to  the  /ip  logic.  All  possible 
next  states  are  generated  and  appear  at  the  input  of  the  pass  transistors  controlled  by  Ip. 
The  circuit  has  “calculated”,  as  determined  by  Eq.  2,  all  possible  next  states  that  the 
circuit  could  enter  and  is  prepared  to  assume  any  and  every  next  state  as  defined  by  the 
fip  terms.  The  exact  next  state  is  specified  by  the  Ip  state  that  becomes  1. 

The  state  assignment  problem  for  asynchronous  sequential  circuits  is  always  a sig- 
nificant problem.  Pulse  mode  flow  tables  are  in  every  way  asynchronous  in  operation. 
Therefore,  the  designer  must  be  concerned  about  state  assignment  issues.  Assume  the 
present  state  of  the  circuit  is  S{  and  state  Sj  is  the  next  state  when  input  Ij  becomes  1. 
When  all  inputs  are  0 prior  to  Ij  — 1,  the  state  variables  t/;  define  the  circuit  to  be  in  state 
Si . When  Ij  = 1,  since  transistor  T is  disabled,  the  next  state  variables  Yi  do  not  change. 
Y{  changes  to  assume  values  associated  with  Sj  as  defined  by  Eq.  2 when  Ij  — 1*  However, 
Y{  remains  unchanged  as  long  as  Ij  — 1,  Y;  does  not  change  to  the  value  of  Sj  until  Ij 
returns  to  0,  at  which  time  Yi  cannot  change.  Therefore,  each  state  transition  occurs  in 
two  stages: 


Yi  = Eip(/ip)  when  Ip  = 1 
Yi  = Y{  when  all  Ip  = 0 

A critical  race  can  exist  in  an  asynchronous  sequential  circuit  only  when  the  state 
variables  yi  being  fed  back  can  affect  Yi  without  a change  in  input.  Since  the  inputs  must 
change  before  present  state  variable  yi  can  affect  next  state  variable  Y^  no  critical  race 
can  occur.  The  following  theorem  has  been  established. 

Theorem  1 Asynchronous  sequential  circuits  implemented  with  the  basic  circuit  shown  in 
Fig.  1 are  void  of  critical  races. 

If  the  circuit  cannot  experience  a critical  race,  then  the  Single  Transition  Time  (STT) 
state  assignment  procedures  need  not  be  followed,  specifically  the  Tracey  conditions[5]  need 
not  be  met.  Moreover,  since  the  STT  conditions  need  not  be  met,  any  state  assignment  is 
satisfactory  as  long  as  each  state  has  a unique  code. 

The  design  procedure  can  be  stated  as  follows: 

Procedure  1 Step  1 Create  an  appropriate  flow  table . 

Step  2 Provide  a state  assignment  where  each  state  has  a unique  code. 
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D 

- 
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- 

1 

Table  1:  Example  Flow  Table 

Step  3 Form  the  state  table. 

Step  4 Find  the  next  state  equations  in  the  following  form: 

r.  = r.UU) 

where  each  input  passes  an  f;p  expression  of  state  variables . 

Example  1 Realize  a circuit  which  has  two  pulse  inputs  X and  C and  a level  output  Z > 
C represents  a clock  that  produces  pulses  at  a regular  interval  Z must  be  1 between  pulses 
C{  and  Ct+i  only  if  an  X pulse  occurred  between  clock  pulses  C{- \ and  G{. 

The  reduced  flow  table  with  the  state  assignment  is  shown  in  Table  1.  The  design 
equations  for  this  flow  table  are: 

U=X(yi)+  C(yT(Vi)  + Vitift)) 

Y2  = X(yi)+  C(yi(y2)  + J/i(2/2  )) 

2.1  Design  By  Inspection 

The  synchronous  state  assignment  procedure  allows  for  a great  deal  of  flexibility.  The  one- 
hot-code  is  well  known  as  a state  assignment  that  allows  one  derive  the  design  equations 
by  inspection.  A one-hot-code  encodes  an  n-row  flow  table  with  n-state  variables  where 
state  Si  is  encoded  with  y;  = 1 and  all  other  yj  = 0,  j ^ i.  A predecessor  state  of  jstate  Si 
is  a state  the  circuit  is  in  prior  to  an  input  change  that  forces  the  circuit  into  Sj . .. 

If  Sj  is  a predecessor  state  to  Si  under  input  Ipi  the  partial  next  state  equation  is 

y,  = /„(»).  :,v;„  - ■ 

If  SjuSn,---  , Sjk  are  predecessor  states  to  then  the  partial  next  state  equation  is 

Yi  = IP(yji  + yj2  + — ht/jfc). 

In  general,  the  /,p  terms  become  simple  sum-of-products  where  each  product  is  an  uncom- 
plemented state  variable. 

Design  Procedure  1 can  be  employed  by  simply  changing  Step  2 to  implementing^  one- 
hot-code.  The  equations  can  be  formed  by  the  well  known  inspection  method-  Simplier 
/•  terms  result.  The  disadvantage  is  that  more  state  variables  are  generally  needed.  The 
design  equations  for  Table  1 are 
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Y 1=  C(yi+y3) 

Y 2=  X(yi) 

Y Z — C(y2  + 1/4) 

Y 4=  X{y3) 

Liu[6]  proposed  a design  technique  for  iterative  logic  array  synchronous  sequential  cir- 
cuits that  have  the  unique  property  where  each  state  has  predecessor  states  only  in  one 
input.  The  next  state  equations  have  the  form 

Y = /,(/*) 

where  each  /ip  is  a sum-of-products  with  each  product  term  consisting  of  a single  com- 
plemented or  uncomplemented  state  variable.  This  technique  reduces  the  amount  of  logic 
further  for  each  next  state  variable  in  that  only  one  input  Ip  pass  gate  is  needed.  The 
potential  disadvantage  is  that  more  state  variables  can  be  needed. 

3 Tolerance  to  1-1  Input  Overlap 

In  the  previous  section  there  were  no  constraints  on  the  width  of  each  input  pulse.  (The 
minimum  width  must  be  long  enough  to  pass  the  signal  to  the  output  of  the  input  inverters 
at  the  first  state).  It  was  assumed  that  only  one  Ip  would  be  1.  This  condition  can  be 
relaxed.  For  simplicity,  suppose  two  inputs  Ip  and  Iq  are  both  1.  Moreover,  suppose  the 
circuit  should  transition  from  Si  to  Sp  or  Sq  under  Ip  or  Iq  respectively.  When  both  Ip  and 
Iq  are  1, 

Y = Wip)  + *',(/.?) 

As  long  as  Ip  and  Iq  are  1,  there  can  be  conflicting  signals  at  the  input  to  the  inverter 
of  the  first  stage  of  Fig.  1.  Since,  at  least  one  input  = 1,  transistor  T is  not  enabled  and 
yi  does  not  change  and  the  conflict  does  not  affect  the  present  state.  The  circuit  remains 
in  state  S { and  will  remain  in  S{  until  both  Ip  and  Iq  = 0.  If  Ip(Iq)  remains  1 longer  than 
Iq(Ip)>  then  fip(fiq)  will  be  passed  to  specify  Y]  and  only  when  both  inputs  are  0 will  the 
circuit  transition  to  Sp(Sq ).  Therefore,  the  circuit  action  is  determined  by  the  input  that 
remains  1 last. 

Theorem  2 If  more  than  one  input  state  is  1 , then  the  next  state  of  the  circuit  is  deter- 
mined by  the  input  that  remains  1 last . 

Proof:  If  more  than  one  input  = 1,  then  Yi  is  determined  by  the  equation 

Y = ZUW<i) 

where  Ij  are  those  inputs  that  are  1.  Since  yi  changes  only  when  all  Ij  are  0,  the  circuit 
does  not  transition  until  all  Ij  = 0.  Suppose  Ip  is  the  last  input  that  is  1.  Then  the 
equation  for  Y{  becomes 
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Yt  = WiP)- 

When  Ip  transitions  to  0,  then  Yi  assumes  the  state  determined  by  Yi  which  was  specified 

by  iP. 

QED.  _ = _____ 

Prom  Theorem  2,  it  is  clear  that  the  order  in  which  inputs  transition  from  1 — » 0 is 
important.  Transitions  from  0 -»  1 are  unimportant.  Therefore,  if  more  than  one  input 
state  is  1,  it  is  unimportant  which  order  the  inputs  transition  0 — » 1.  The  next  state  is 
specified  by  the  last  input  that  transitions  1 — » 0.  For  example,  suppose  there  are  four 
input  states  for  a circuit.  If  the  inputs  transition  as  shown  in  Fig.  2,  the  circuit  will  assume 
the  state  specified  by  1$  when  all  the  inputs  are  0. 


h 


I 2 


h 


u 


Figure  2:  Input  Waveform  Example 


4 Level  Input  Circuits 

The  previous  discussion  focused  on  pulse  mode  circuits.  Several  researchers  have  intro- 
duced the  notion  of  transition  sensitive  asynchronous  sequential  circuit  design  [2,7].  Bre- 
deson  [2]  converted  a level  input  flow  table  to  a transition  sensitive  (TS)  flow  table.  A 
TS  flow  table  shows  the  table  entries  that  result  from  a change  in  inputs.  The  essential 
feature  in  a TS  design  is  that  inputs  are  represented  as  pulses  which  are  created  whenever 
the  input  state  transitions  from  0 — > 1.  Consider  the  level  input  flow  table  of  Table  2.  The 
TS  representation  of  this  flow  table  is  shown  in  Table  3.  Once  the  flow  table  is  in  the  TS 
form,  the  design  procedure  in  Section  2 applies. 

Bredeson  introduced  another  notion  in  the  design  of  TS  circuits.  If  one  begins  with 
a primitive  row  flow  table,  then  the  input  state  variables  can  become  the  state  variables. 
Additional  state  variables  are  needed  only  to  produce  unique  codes  for  the  states  and  this 
is  accomplished  by  partitioning  stable  states  in  each  column  of  the  flow  table.  In  Table  3, 
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Table  2:  Level  Input  Flow  Table 
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Table  3:  Transition  Sensitive  Flow  Table 
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y\  and  y3  are  assigned  to  i\  and  xt  respectively.  State  variable  y3  is  assigned  to  partition 
the  stable  states  in  each  column.  For  example  y3  partitions  states  A and  D in  the  first 
column.  Therefore  only  one  state  variable  is  needed  to  implement  the  flow  table  rather 
than  the  expected  three. 

5 State  Sequence  Detection 

It  might  be  desirable  to  be  able  to  detect  the  potential  transition  between  a pair  of  states 
that  might  be  associated  with  a critical  event.  Suppose  state  3*  can  be  entered  only  from 
state  Sj  under  fault  free  conditions.  If  state  5*  is  entered  from  state  Sn>  i ^ n,  then  an 
error  has  occurred.  In  some  cases,  such  a transition  should  not  be  allowed. 

The  circuit  presented  here  is  capable  of  providing  information  necessary  to  detect  the 
occurrence  of  a transition  between  a pair  of  states  prior  to  the  actual  transition.  If  one 
knows  that  an  undesirable  transition  is  about  to  occur,  it  is  possible  to  prevent  the  tran- 
sition and  avoid  an  unwanted  event. 

State  information  is  present  at  two  points  in  the  circuit  of  Fig.  1.  The  present  state  is 
available  at  the  output  of  the  second  stage  YJ.  When  the  next  input  state  is  1,  the  next 
state  information  is  specified  by  To  detect  a sequence  between  a pair  of  states,  then 
the  state  information  at  Y{  and  Y+  can  be  decoded. 

If  it  was  desired  to  permit  a transition  to  state  Sk  only  from  state  5*-,  then  5*  can 
be  decoded  from  YJ  and  Sk  can  be  decoded  from  Y*.  If  the  next  state  as  specified  by  Y{ 
is  Sk  and  the  present  state  is  not  S{  as  specified  by  YJ,  then  an  error  condition  can  be 
signaled.  This  is  depicted  in  Fig.  3.  To  prevent  the  circuit  from  assuming  state  Sk  under 
the  error  condition,  the  error  signal  can  be  fed  into  the  NOR  gate  which  drives  transistor 
T in  Fig,  1.  The  error  signal  would  prevent  the  circuit  transition  to  state  Sk-  Moreover, 
since  transistor  T is  not  enabled  when  the  error  condition  is  detected,  the  circuit  will  not 
transition  to  Sk  and  remain  in  S{.  It  might  be  desirable  to  stop  all  processing  when  the 
error  condition  is  detected.  If  so,  the  error  signal  can  be  used  to  disable  all  further  input 
state  changes  and  the  circuit  would  remain  in  the  current  state  without  any  further  state 
transitions.  Yi  will  specify  the  incorrect  state  Sj}  Sj  ^ S].  If  one  desired  to  know  the  value 
of  Sj , Y{  could  be  examined  to  reveal  the  error  state  to  help  with  diagnostics. 


6 Fault  Detection 

Classical  fault  detection  of  sequential  faults  includes  using  an  error  detection  code  on  the 
state  assignment  [8].  If  hardware  is  not  shared,  a single  error  detection  code  is  sufficient. 
Since  the  design  approach  used  here  does  not  share  logic,  except  for  the  NOR  gate  which 
drives  the  T transistors,  a single  error  detection  code  can  be  employed  and  is  used  in  the 
work  presented  here.  It  is  assumed  that  the  NOR  gate  is  hard  core  for  this  discussion. 
Moreover,  it  is  assumed  that  only  one  device  can  fail  at  a time  and  that  the  circuit  will 
assume  all  total  circuit  states  before  a second  fault  can  occur.  In  this  discussion,  all  faults 
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that  can  cause  a false  next  state  value  are  detectable;  this  includes  stuck-at,  stuck-open 
and  stuck-on  faults. 

The  circuit  presented  thus  far  has  some  interesting  fault  detection  capabilities.  Most 
other  fault  detection  mechanisms  for  sequential  circuits  detect  the  presence  of  a fault  after 
the  circuit  has  assumed  a faulty  state.  This  circuit  is  able  to  detect  the  presence  of  a fault 
in  most  of  the  circuit  before  the  circuit  actually  enters  the  fault  state. 

In  this  discussion,  it  is  assume  that  a simple  parity  code  is  used  for  fault  detection. 
Under  the  single  fault  assumptions  above,  only  one  extra  state  variable  needs  to  be  added. 
Let  the  states  of  the  flow  table  be  encoded  with  an  even  parity  state  assignment.  Whenever 
odd  parity  is  assumed  by  the  state  variables,  a fault  condition  is  detectable.  Let  all  odd 
parity  states  (fault  states)  be  assigned  to  have  a next  state  value  that  is  also  odd  parity. 

Therefore,  whenever  an  odd  parity  state  is  assumed,  the  next  state  is  also  an  odd  parity 
state. 

The  circuit  for  fault  detection  is  shown  in  Fig.  4.  The  fault  detector  simply  detects 
the  presence  of  odd  parity  on  the  state  assignment;  f is  assigned  to  equal  1 when  an  odd 
parity  state  is  present.  The  fault  detector  monitors  the  parity  of  Y{.  If  a fault  occurs  to 
any  of  the  circuitry  that  produces  Yiy  f will  detect  its  presence.  With  a fault,  f = 1,  and 
since  f feeds  into  the  NOR  gate,  the  T transistor  is  not  enabled  and  the  fault  state  cannot 
be  assumed  by  Y{.  In  this  case,  the  circuit  does  not  enter  the  fault  state.  Moreover,  if  the 
input  states  can  be  disabled,  the  circuit  will  remain  in  the  current  state. 

Signal  f will  be  driven  towards  a 1 value  as  the  circuit  transitions  between  unstable  and 
stable  states.  Signal  f then  would  prevent  the  T transistor  from  being  enabled,  but  this 
actually  helps  the  circuit  not  enter  an  improper  state.  Signal  f can  be  used  therefore  to 
produce  a self  synchronizing  signal,  but  this  is  a subject  beyond  the  scope  of  this  paper. 

If  a fault  occurs  in  the  second  stage  after  the  T transistor,  then  an  odd  parity  state  will 
be  entered.  The  next  state  value  as  specified  by  the  fip  terms  will  be  odd  parity  also  since 
it  is  assumed  that  only  one  fault  is  present.  If  the  fip  terms  generate  odd  parity,  then  V 
will  also  have  odd  parity  and  then  f = 1 with  the  fault  being  detected. 

A fault  in  a T transistor  will  have  the  same  impact  as  a fault  in  the  second  stage.  If 
Y{  assumes  the  correct  value  in  spite  of  a faulty  T,  no  error  is  detected  and  the  circuit 
operates  as  designed.  Only  when  Yi  assumes  an  incorrect  value  will  an  odd  parity  state  be 
entered  and  hence  detected. 


T Summary 

A fundamental  logic  circuit  has  been  presented  that  will  allow  for  efficient  implementation 
of  pulse  mode  asynchronous  sequential  circuits.  Level  input  flow  tables  can  be  transformed 
into  transition  sensitive  flow  tables  which  can  be  directly  implemented  with  the  circuit 
presented  here.  The  resulting  circuits  are  tolerant  of  1-1  crossover  conditions.  The  final 
next  state  of  the  circuit  is  determined  by  the  last  input  that  is  1 whenever  more  than  one 
input  state  is  1. 

The  unique  characteristic  of  state  sequence  detection  can  be  achieved  with  this  circuit. 
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It  is  very  easy  to  detect  the  present  and  next  state  in  the  circuitry  and  to  prevent  next 
state  transitions  to  occur.  In  addition  to  state  sequence  detection,  real  time  fault  detection 
can  be  achieved  where  a fault  state  can  be  detected  prior  to  a transition  to  a fault  state. 
This  fault  detection  capability  covers  a wide  range  of  fault  conditions  and  possible  faults 
in  the  circuit. 
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Figure  4:  Fault  Detection  Logic 
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Abstract - This  paper  introduces  an  improved  method  for  designing  the  class 
of  CMOS  VLSI  asynchronous  sequential  circuits  introduced  in  the  paper  by 
Sterling  R.  Whitaker  and  Gary  K.  Maki,  “Self  Arbitrated  VLSI  Asynchronous 
Circuits.” 


1 Introduction 

Synchronous  sequential  circuits  are  often  the  first  choice  in  VLSI  design.  Races  are  avoided 
by  synchronizing  the  circuit  with  a common  clock  signal;  however,  the  frequency  of  this 
signal  must  be  slow  enough  to  allow  signals  to  propagate  through  the  slowest  block  re- 
gardless of  how  often  that  block’s  output  is  actually  used.  Also,  the  RC  delays  introduced 
in  VLSI  design  make  synchronizing  the  circuit  with  a clock  signal  increasingly  difficult 
as  the  circuit  s complexity  increases.  Moreover,  with  CMOS  circuits  peak  power  usage  is 
attained  during  switching.  If  several  blocks  are  synchronized  by  a clock  signal  and,  thus, 
switching  at  the  same  time  then  the  peak  power  required  by  the  chip  is  greatly  increased. 

These  limitations  can  be  avoided  by  designing  with  asynchronous  circuits.  The  paper 
by  S.  Whitaker,  Self  Arbitrated  VLSI  Asynchronous  Circuits”,  presents  an  asynchronous 
circuit  with  some  interesting  qualities.  Of  main  interest  here  is  the  simple  design  by 
inspection  rules  that  arise  from  this  circuit.  This  paper  presents  a variation  on  Whitaker’s 
circuit  which  reduces  the  number  of  transistors  required. 

2 Circuit  Model 

The  general  model  for  this  circuit  is  the  same  as  that  given  in  Whitaker’s  research.  There 
is  an  enable  and  disable  block  feeding  into  a buffer  stage  as  shown  in  Figure  1. 


Figure  1:  General  Circuit  Model 
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Where  and  Yi  are  present  and  next  state  variables  respectively,  and  Ij  represents 
input  signals. 

The  variation  on  the  original  circuit  presented  here  differs  in  how  the  buffer,  enable, 
and  disable  blocks  are  implemented.  "Whitaker  s buffer  circuit  consisted  of  two  inverters 
and  two  weak  feedback  transistors  as  shown  in  Figure  2: 


Figure  2:  Original  Buffer  Circuit 

As  shown,  this  buffer  circuit  provides  not  only  Y,  but  also  Y^.  The  buffer  state  table 
for  this  circuit  is  given  in  table  one. 

Table  1:  Original  Buffer  State  Table 


^i 

Input 

Yi 

0 

0 

0 

0 

1 

1 

i 

0 

0 

i 

1 

1 

0 

z 

0 

i 

z 

1 

The  buffer  for  the  circuit  presented  in  this  paper  is  shown  in  Figure  3.  Although  it 
only  produces  the  Yi  variable  and  not  it’s  complement,  it  saves  two  transistors  and  in  the 
design  procedure  for  this  circuit  the  complemented  variable  is  not  needed.  The  buffer  state 
table  for  this  circuit  is  given  in  Table  2. 


hi  w 
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Figure  3:  New  Buffer  Circuit 

The  enable  and  disable  blocks  are  simple  pass  networks  and  their  design  is  completely 
specified  by  the  design  equations  given  later  in  this  paper. 

Table  2:  New  Buffer  State  Table 


Input 

Yi 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

1 

0 

0 

z 

0 

1 

z 

1 

3 State  Assignment 

The  state  assignment  for  a flow  table  remains  the  same  in  this  paper  as  in  its  predecessor, 
a simple  one-hot-code  state  assignment  as  shown  in  Table  3. 
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Table  3:  Example  Flow  Table 

y y y y y y ; 

II  ^2  ^3  a b c d e f I 

% 

100000 
010000 
001000 
000100 

o o o oio  : 

000001 

These  circuits  effect  a non-normal  transition.  Thus,  in  order  to  show  that  the  circuit 
operates  correctly,  it  is  necessary  to  show  that  no  transition  path  between  two  states 
overlaps  any  transition  path  between  two  other  states. 

Definition  1:  Let  / y,- / be  the  state  with  bit  y < set  and  let  [ya  yj,  yc.../  be  the  binary  g 

number  with  the  appropriate  number  of  bits  with  the  ya)  yb,  yc bits  set.  ~ 

Definition  2:  By  definition  (to  be  shown  later)  let  a proper  transition  path  between 
fyiJ  and  [yj]  be  /y;  yjj.  Note  that  from  the  one  hot  code  state  assignment  every  state  can 
be  expressed  by  [yk},  where  yk  represents  the  one  bit  which  should  be  set  for  any  state. 

Theorem  1:  Given  a one-hot-code  state  assignment  with  the  above  “proper”  transition 
path,  no  two  transition  paths  will  overlap. 

Proof:  Since  the  transition  path  for  two  states  [y,]  and  [y_,]  is  given  by  [y<  y;]  this  will 
overlap  with  the  transition  [y0]  and  [yt]  given  by  [y„  y&]  only  if  ya  = y;  and  yt  = yy ; thus, 
the  transition  between  two  states  only  overlaps  itself. 

Therefore,  to  show  that  the  circuit  presented  here  correctly  implements  a flow  table  it 
must  simply  be  shown  that  it  correctly  implements  the  “proper”  transition  path  referred 
to  above. 

4 Design  Procedure 

This  section  starts  with  a definition. 

Definition  3:  Let  a scale-of-two  of  loop  in  a flow  table  be  defined  by  a any  state  A 
under  input  Ik  which  goes  to  state  B under  input  Ij  where  state  B returns  to  state  A under 
input  Ik- 

The  basic  design  equations  for  a circuit  can  be  expressed  as: 


£(I.,£yfc(0))  + £yj(l) 
Enable  Expr.  + Disable  Expr. 


(1) 
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Note  that  the  disable  expression  form  only  holds  in  the  absence  of  order-of-two  loops. 

Looking  at  an  example  flow  table  as  given  in  Table  3 and  reproduced  here  for  conve- 
nience: 


Table  3:  Example  Flow  Table 


y y y y y y 

Ij  I2  I3  a b c d e f 


A 

A 

F 

1 0 

0 

0 

0 

0 

B 

A 

Q 

A 

0 1 

0 

0 

0 

0 

C 

A 

B 

O 

0 0 

1 

0 

0 

0 

D 

® 

B 

C 

0 0 

0 

1 

0 

0 

E 

D 

® 

c 

0 0 

0 

0 

1 

0 

F 

D 

E 

0 0 

0 

0 

0 

1 

For  each  state  the  design  procedure  is  simple: 

For  state  [y^]: 

1.  Identify  all  input  states  Ij  under  which  [y;]  is  stable.  These  input  states  become  the 
Jj’s  in  the  basic  equation. 

2.  For  each  Ij  identify  all  unstable  states  [y^]  in  the  same  column  and  note  the  stable 
state’s  row  [y*]  under  which  they  occur.  Again,  these  [y^j’s  become  the  y^’s  in  the 
basic  equation  under  the  appropriate  Jj’s. 

3.  Thus,  the  enable  equation  can  be  written  as: 

SflySirttO))  (2) 

4.  Identify  all  unstable  states  [y/]  under  the  row  for  the  stable  state  [jfi], 

5.  The  disable  expressions  for  the  state  [t/j]  can  be  written  as: 

E»(l)  (3) 

6.  For  each  term  in  the  disable  expression  determine  if  it  is  a member  of  a scale-of-two 
loop.  If  so,  include  the  input  state  under  which  this  term  occurs  as  part  of  the  term, 
For  example,  if  y/(l)  was  under  Ik  and  was  a member  of  a scale-of-two  loop,  then 
y/(l)  would  become  i*yi(l). 

As  an  example,  the  enable  expression  for  state  A,  [ya],  would  be: 


h(yb(  0)  + »c(0)) 


(4) 
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Figure  4:  Example  Enable  Block 
And,  the  disable  expression  for  state  A would  be: 


5 Circuit  Operation 

In  order  to  show  that  this  circuit  operates  correctly,  it  must  be  shown  that  the  circuit 
transitions  properly.  To  do  this  it  must  be  shown  that  in  going  from  state  [y;]  to  [yf\  under 
input  Ij  the  transition  path  is  [y,-  yj]. 

Theorem  2:  Only  positive  and  not  complemented  variables  are  used  in  the  design 
equations  for  this  circuit. 

Proof:  This  follows  directly  from  the  design  procedure  and  the  basic  design  equation. 

Theorem  3:  The  enable  and  disable  equations  for  [y^J  do  not  contain  the  present  state 
term  y;. 
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Proof:  The  design  equations  for  state  [yi]  derive  their  terms  from  the  unstable  states 
in  row  [ y {]  and  the  stable  states  in  the  other  rows  which  contain  unstable  [yx-]  states.  Since 
every  state  [yi]  in  the  row  for  the  state  [yi]  is  stable  and  every  state  [y»]  not  in  the  row  for 
state  [y;]  is  unstable,  then  the  ya-  term  never  appears  in  the  design  equations  for  state  [y,*]. 

Theorem  4:  For  a reduced  row  flow  table,  for  each  unstable  state  [yuJ  under  input  Iu 
on  the  stable  state  [y9]’s  row  that  unstable  state  will  contribute  only  the  following  terms  to 
the  design  equations  for  that  table: 

yu(l)  or  Iuyu(  1)  disable  term  for  Ys 
/uy#(0)  enable  term  for  Yu 

Proof:  The  proof  follows  directly  from  the  design  procedure. 

Theorem  5:  For  a reduced  row  flow  table  that  transitions  from  state  [yi]  to  the  state 
fyk]  under  input  Ij  then  while  the  circuit  is  in  state  / y,* J under  input  Ij  the  only  variable  to 
be  affected  is  Yk  which  is  set  high. 

Proof:  If  this  theorem  were  not  true  then  either  1)  Y{  would  be  set  to  0 from  this 
state,  or  2)  Yk  would  not  be  changed,  or  3)  Some  other  variable  (all  of  which  are  zero  at 
this  point  due  to  one-hot-code  state  assignment)  would  be  raised  to  a 1. 

First,  for  YJ  to  be  set  to  zero  with  all  other  state  variables  at  zero  then  the  design 
equation  would  have  to  contain  the  term  y;(l),  which  is  invalid  from  Theorem  3,  or  y£(l), 
which  is  invalid  from  Theorem  2. 

Second,  Yk  will  be  set  high  from  Theorem  4 and  the  design  procedure  which  state  that 
the  enable  expression  for  Yk  will  include  /jyjt(O).  Note  that  a conflict  could  occur  if  the 
design  equation  for  Yk  contained  y,(l);  however,  this  would  be  a scale-of-two  loop  and  the 
y;(l)  term  would  be  7^(1)  where  Ik  is  different  from  Ij.  Thus,  there  would  be  no  conflict 
for  a flow  table  properly  designed  and  YJ,  will  be  set  high. 

Finally,  for  another  variable  [ym]  to  be  set  high  from  the  state  [y[\  under  Ij  then  it’s 
enable  equation  would  have  to  contain  the  term  ljyi(0)  which  would,  from  the  design 
procedure,  mean  that  under  the  row  for  state  [ya]  under  Ij  would  be  state  [ym];  however, 
by  definition  this  location  contains  [yk]. 

Thus,  the  theorem  must  be  true  since  all  other  alternatives  are  false. 

Theorem  0:  If  the  circuit  is  in  the  state  [y{  yk]  under  input  Ij  where  state  [yi]  goes  to 
state  [yk]  under  input  Ij  then  the  only  variable  to  be  affected  is  Yi  which  is  set  low. 

Proof:  If  this  theorem  were  not  true  then  either  1)  Yi  would  stay  high,  or  2)  Yk  would 
be  set  low,  or  3)  Some  other  variable  would  be  set  high  (since  all  the  other  variables  are, 
by  definition,  low  to  begin  with.) 

First,  from  Theorem  4 and  the  design  procedure  the  design  equations  for  Yi  contains 
the  term  y^(l)  or  7jy^(l);  thus,  it  would  remain  only  not  go  low  if  it  also  included  7yyt*(0), 
which  is  invalidated  by  Theorem  3,  or  7jy7(0),  which  is  invalidated  by  Theorem  2,  or 
IjVk(O)  which  can  be  invalidated  by  the  following  argument.  If  the  enable  expression  for 
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Y{  contained  any  Ij  terms  then  the  position  under  Ij  would  have  to  contain  [y,-]  and,  by 
definition,  it  contains  [y*,].  Thus,  Y;  is  set  low. 

Second,  for  Yk  to  be  set  low  it  would  contain  the  term  y*(l)  or  which  can 

be  invalidated  from  theorem  three,  or  yl{  1),  which  can  be  invalidated  by  theorem  two,  or 
y,(l)  which  can  be  invalidated  by  the  following  argument.  If  Y*  did  contain  y;(l)  then  that 
would  be  a seale-of-two  loop  and  the  y,(l)  term  would  have  to  be  Jmy,(  1)  where  Jm  is  not 
equal  to  Ij]  thus,  Yk  will  not  be  set  low. 

Finally,  for  another  variable  [ym]  to  be  set  high  from  the  state  [y,-  yj]  under  Ij,  the 
enable  equation  would  have  to  contain  either  the  term  7,y,(0)  or  J,yj(0),  The  term  Jjy,(0) 
would,  from  the  design  procedure,  mean  that  under  the  row  for  state  [y<]  under  Ij  would 
be  state  [ym]  which  by  definition  contains  [y*].  The  term  ljyj(0)  would,  from  the  design 
procedure,  mean  that  under  the  row  for  state  [yj]  under  Ij  would  be  state  [ym]  which  by 
definition  contains  the  stable  state  [y^-J.  Thus,  no  other  variable  will  be  set  high. 

Therefore,  since  all  the  other  alternatives  are  proven  false,  this  theorem  must  be  true. 

So  from  Theorems  5 and  6 the  circuit  created  from  the  design  procedure  in  the  previous 
section  fulfills  Theorem  1 and  is  critical-race  free. 


6 Transistor  Count  Comparison 

From  the  assignment  procedure  it  is  obvious  that  we  need  one  state  variable  from  each 
state.  Also,  from  Figure  1,  which  shows  the  general  circuit  model  for  this  circuit,  it  is 
clear  that  for  each  state  variable  we  need  four  transistors  ( one  buffer.)  Moreover,  from 
Theorem  4 and  the  design  procedure  it  can  be  shown  that  for  each  stable  state  [y,]  in  a 
flow  table  one  transistor  is  required  if  an  unstable  state  [y^j  is  also  in  that  column  and  for 
every  unstable  state  two  transistors  are  required.  Also,  every  scale-of-two  loop  contributes 
an  additional  two  transistors.  Thus,  the  number  of  transistors  needed  for  a reduced  row 
flow  table  is  given  by: 


Total  # of  Transistors  = 45  + B + 2U  + 2 L (6) 

Where  s = number  of  states,  b = number  of  stable  states  in  the  low  table  with  identical 
unstable  states  in  the  same  column,  u = number  of  unstable  states  in  flow  table,  and  L 
= number  of  scale-of-two  loops  that  exist  in  the  flow  table.  The  number  of  transistors  for 
the  design  method  in  Whitaker’s  paper  can  be  shown  to  be: 

Total  of  Transistors  — 65  + 4?7  + 2L  (7) 

Thus,  for  the  example  flow  table  given  in  Table  3 where  s=6,  b=6,  u=12,  and  L=Q  the 
total  number  of  transistors  for  the  improved  design  method  is  54,  and  the  total  number 
for  the  old  method  is  84;  thus,  a difference  of  30  transistors. 

Table  4 shows  some  typical  values  for  reduced  row  flow  tables  and  compares  transistor 
counts. 


mi; in i« inn  i ' ip 
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Table  4:  Transistor  Count  Comparison 


s 

B 

U 

L 

Transistor  Count 

New/Old 

Old  Method 

New  Method 

6 

6 

12 

0 

84 

54 

0.643 

8 

9 

15 

1 

no 

73 

0.664 

7 

10 

18 

2 

118 

78 

0.661 

9 

10 

17 

0 

122 

80 

0.656 

Reset  Feature 


In  any  circuit,  asynchronous  or  synchronous,  it  is  advantageous  to  be  able  to  preset  the 
circuit  to  some  state.  This  is  almost  a necessity  upon  startup  since  a circuit  often  begins 
in  an  unknown  or  undesirable  state.  The  basic  model  for  this  circuit  can  be  modified  to 
easily  include  a reset  feature  as  shown  in  figure  six: 


Figure  6:  Modified  General  Circuit  Model 

This  feature  only  costs  one  transistor  for  each  state;  however,  the  design  must  insure 
that  all  inputs  are  low  while  the  reset  is  high. 


Summary 

The  general  circuit  model  for  this  paper  and  the  one-hot-code  state  assignment  lead  to 
an  easily  designed  and  implemented  asynchronous  circuit.  Once  an  input  is  introduced 
the  circuit  sets  the  new  variable  high  which  then  sets  the  variable  signifying  the  old  state 
low.  This  [y»]  ;[yi  2/y] ; [yy]  non-normal  mode  transition  insures  that  no  two  transition  paths 
overlap;  thus,  the  circuit  is  critical-race  free. 

This  improved  design  method  reduces  the  transistor  count  from  the  old  method  by, 
roughly,  one  third,  decreasing  the  size  of  the  overall  circuit  and  increasing  its  usefulness. 

Finally,  although  the  circuit  operates  at  1/2  the  speed  of  an  STT  state  assignment 
asynchronous  circuit  due  to  its  non-normal  mode  operation  it  is  very  easy  to  design  and 
avoids  the  disadvantages  of  a synchronous  circuit  such  as  clock  routing,  power  bussing, 
and  speed  dependency  upon  slowest  information  path. 
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Abstract-  Design  of  general/special  purpose  Supercomputing  VLSI  systems  for 
numeric  algorithm  execution  involves  tackling  two  important  aspects  namely 
their  computational  and  communication  complexities.  Development  of  soft- 
ware tools  for  designing  such  systems  itself  becomes  complex.  Hence  a novel 
design  methodology  has  to  be  developed.  For  designing  such  complex  systems 
a special  purpose  silicon  compiler  is  needed  in  which 

1.  The  computational  and  communicational  structures  of  different  numeric 

algorithms  should  be  taken  into  account  to  simplify  the  silicon  compiler 
design. 

2.  The  approach  is  macrocell  based. 

3.  The  software  tools  at  different  levels, algorithm  down  to  the  VLSI  circuit 
layout,  should  get  integrated. 


In  this  paper  a special  purpose  silicon  (SPS)  compiler  based  on  PACUBE 
macrocell  VLSI  ARRAYS  [l]  for  designing  supercomputing  VLSI  systems  is 
presented.  It  is  shown  that  turn-around-time  and  silicon  real  estate  get  reduced 
over  the  silicon  compilers  based  on  PLAs,SLAs  and  gate  arrays. 

Characteristics  1 and  2 above  enable  the  SPS  compiler  to  perform  systolic 
mapping  (at  the  macrocell  level)  of  algorithms  whose  computational  structures 
are  of  GIPOP  (Generalized  Inner  Product  Outer  Product)  form  [2].  Direct 
systolic  mapping  on  PLAs,  SLAs  and  gate  arrays  is  very  difficult  as  they  are 
micro-cell  based.  A novel  GIPOP  processor  is  under  development  using  this 
special  purpose  silicon  compiler. 


* pursuing  their  higher  studies/employed  in  India  or  abroad.  Contact  for  communication  regarding  this 
paper  N.Venkateswaran  Additional  Professor,  Dept  of  Computer  Science  and  Engineering,  Sri  Venkateswara 
College  of  Engineering, University  of  Madras, India.  Authors’  names  listed  randomly. 


13.1.2 


1 Introduction 

No  silicon  compiler  has  yet  been  developed  exclusively  for  tackling  the  complexity  in  de- 
signing supercomputing  VLSI  systems.  In  developing  such  a compiler  besides  achieving  re- 
duced turn-around-time  and  area,  computational  performance  of  mapped  functional  units 
and  architectural  characteristics  like  systolic  mapping  should  also  be  considered.  The  latter 
two  factors  are  not  taken  care  of  in  micro-cell  based  silicon  compilers. 

In  conventional  compilers  the  software  tools  are  not  integrated  from  algorithm  down  to 
the  VLSI  circuit  level.  The  integration  of  software  tools  at  different  levels  can  be  achieved 
by  adopting  novel  methodologies  in  designing  supercomputing  VLSI  systems  (processors 
and  arrays)  and  developing  novel  algorithm  mapping  techniques.  This  integration  reduces 
the  software  complexity  to  a great  extent. 

The  turn-around-time  is  high  if  gate  arrays,  PLAs  and  SLA s are  employed  for  super- 
computing  system  synthesis.  Further  the  silicon  compilation  should  take  place  at  a much 
higher  level  than  at  the  gate  or  the  micro-cell  level  for  supercomputing  system  design. 

The  PACUBE  macro-cell  structure  is  a combination  of  PLAs,  SLAs.gate  arrays  and 
standard  cells.Besides  storage  and  logic  elements  an  important  computing  unit, the  DRAA 
(Dynamically  Reconfigurable  Array  Adder  [1]  ) is  also  present  (Fig  1)^  The  presence  of  the 
DRAA  reduces  the  turn-around-time  and  silicon  area  drastically  for  designing  special  pur- 
pose VLSI  systems.  The  DRAA  helps  in  achieving  high  performance  due  to  its  functional 
and  architectural  characteristics.  If  the  DRAA  were  to  be  mapped  on  to  PLAs,  SLAs 
and  gate  arrays  the  performance  will  get  degraded.The  special  purpose  silicon  compiler 
built  for  mapping  supercomputing  systems  on  the  PACUBE  arrays  achieves  all  the  above 

factors. 


2 PACUBE  Macrocell  Array  And  Systolic  Mapping 
Tools. 

2.1  Unified  GIPOP  Operations  On  Macrocell  Arrays 

Execution  of  number  of  numeric  algorithms  involve  inner-product  operations.  Several 
VLSI  systems  have  been  proposed  for  executing  numeric  algorithms  whose  computational 
structures  are  of  inner-product  form.  For  this  purpose  VLSI  arrays  of  inner-product  step 

processors  have  been  employed  conventionally. 

In  general  the  computational  structure  of  numeric  algorithms  are  complex.  However 
on  a closer  study  it  is  noted  that  these  structures  can  be  brought  under  a generalized  form 
called  Generalized  Inner-Product  Outer-Product  (GIPOP)  functions.  These  are 


(Ai  * Bi ) 


(A,  + ft) 

Ci 


(1) 


(2) 


iii  mi  an  mm  nu  uii 


PORT  3 


3rd  NASA  Symposium  on  VLSI  Design  1991 


13.1.3 


- — PORT  4 « * PORTS 


Figure  1:  PAi3  (ASLA)  Logic  Model 


The  computational  structures  of  numeric  algorithms  may  involve  a mathematical  com- 
bination of  the  above  two  equations.  Evaluation  of  GIPOP  functions  based  computations 
involve  inner-product  operations,  chain  multiplications,  outer-product  operations  and  re- 
ciprocal operations. 

It  is  shown  in  this  paper  that  by  using  PACUBE  VLSI  arrays  [1]  these  GIPOP  func- 
tions can  be  evaluated  as  a sum  of  equivalent  weighted  inner-product  functions  only.  Also 
massive  parallelism  can  be  employed  in  GIPOP  operations.  This  unification  of  the  execu- 
tion processes  of  GIPOP  functions  only  in  terms  of  equivalent  inner-product  operations  is 
achieved  by  establishing  an  identical  inter  and  intra  macrocell  connections  (data  flow  map- 
ping ) on  PACUBE  arrays  for  chain  multiplication  and  inner-product  operations.  Both 
the  outer-product  and  reciprocal  operations  can  be  expressed  in  terms  of  sum  of  chain 
multiplication  operations  [3]. 

Let  the  inner-product, chain  multiplication,  outer-product  and  reciprocal  operations  be 
defined  as  follows. 


T, 


£U$)P  * U.K;),}  = Tt} 


Tt 


nw*)p  + JtKi),}  = 0{w(Si)p,  J^Ki 


)p>  Tt) 


(3) 


(4) 
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nus),}  = cust)„T,} 

-T^T-  = *{.(*),} 

51,  52,  . . . ,5i  and  if  1,  if  2,  . . . ,Ki  are  the  operands  of  word  length  p bits. 
w - The  weight  of  the  MSB  of  the  word  length. 

Tt  - Number  of  inner-product  terms  (operand  pairs)  or 
Number  of  chain  multiplication  terms  (operands). 

5 .(Si), 

IV  ( )p 

Pi  - Word  length  of  the  partition  i,  i = 1 to  n 

Wi  - Weight  of  the  MSB  of  the  partition  (pi  + P2  H Pi) 

Using  the  relationships  (7)  and  (8)  we  get 


{ Wn  l Si)  Pn?  (Si)  Pn— l ? 

{ tXFn.  (*<)  Pn  y ^n-1  (Si)  Pn- 1 ? 


tel  } 

tul(ifi)pi  } 


(5) 

(6) 


w 

(8) 


Tt 


P = Pi  + Pi  + ■ * * + Pn  = 

| = 1 

Tt 

w = Wi  + w2  4-  ■ • ■ + Wn  = ^2  Wi 

j — 1 

(9) 

(4*5  Defines  binary  multiplication) 
ui  Si  )p  * <J^Ki)p  — w+u^Qi)p+p 

(10) 

2.2  Inner-product  operations  on  PACUBE  Arrays 

Execution  of  inner  product  operations  on  PACUBE  arrays  has  been  dealt  in  [1].  To  achieve 
massive  parallelism  in  evaluating  inner-product  functions  partial  product  arrays  (PPAs) 
of  the  different  product  terms 


(Aj  * B i,  A2  * B 2,  • • • , An  * Bn) 

are  obtained  in  parallel  (forming  a massive  array)  and  added  simultaneously  [1].  Refer 
to  Fig.  2a.  There  are  three  different  ways  of  massive  array  formation  and  reduction  [2]. 
They  are  called  MARI,  MAR2  and  MAR3  processes.  The  figure  2a  corresponds  to 


I(3(A,-)4,  3(^)4,  4) 

function. The  reduction  processes  corresponding  to  MAR2  and  MAR3  are  similar  to 
MARI  reduction  process  presented  in  [1]. 

The  MAR  Process  should  be  chosen  such  that  the  following  important  criteria  are  taken 
care  of 
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1.  The  operands  [4,4]  sub  matrices  of  the  massive  array  injection  into  the  PACUBE 
array  should  be  simpler. 

2.  The  partial  sum  bits  output  corresponding  to  the  sum  output  of  the  massive  array 
should  occur  in  consecutive  cycles  of  the  array  reduction  process. 

3.  The  partial  sum  output  should  be  of  same  length  in  all  cycles  . 

4.  The  partial  sum  output  should  occur  only  in  the  peripheral  macrocells. 

5.  The  intra  macrocell  data  flow  should  be  regular  i.e.  there  is  no  multiplexing  of  data 
between  macrocells. 

6.  Number  of  array  reduction  cycles  and  number  of  macrocells  should  be  minimum. 

2.3  Chain  Multiplication  On  PACUBE  Arrays 

The  application  of  PACUBE  array  can  be  extended  for  chain  multiplication  operation. 

Consider  the  massive  array  formation  for 


CUA^,  3} 

Similar  to  inner  product  operation  three  different  types  of  massive  array  can  be  formed 

for  executing  chain  multiplication.  Refer  to  Fig.  2b.  Only  MAR2  massive  array  formation 
is  shown. 

2.4  Unification  Of  Chain  Multiplication  And  Inner  Product  Op- 
erations On  PACUBE  Arrays 

There  is  a striking  similarity  between  array  formation  corresponding  to  the  inner  product 
' and  chain  multiplication  operations.  The  only  difference  is  in  the  array  sizes.  The  massive 
array  of  C{a(Ai)t,  3}  is  larger  than  that  of  7{3(A;)4,  3(7?,)4,  4}.  In  this  example  the  differ- 
ence in  array  sizes  is  not  much.  In  general  the  massive  arrays  corresponding  to  the  inner 
product  and  chain  multiplication  operations  can  be  made  of  comparable  sizes  by  a proper 
choice  of  the  word  length  and  number  of  terms.  Hence  identical  array  reduction  process 
having  same  inter  and  intra  macrocell  data  flow  can  be  established  on  the  PACUBE  macro- 
cell arrays.  The  operand  injection  points  differ  for  these  operations.  Hence  structurally 
these  two  operations  are  equivalent.  Such  equivalent  pairs  may  not  exist  for  certain  values 
of  word  length  and  number  of  terms.  In  some  cases  even  if  the  equivalent  pairs  exist  the 
word  length  of  the  pairs  may  be  of  odd  values.  The  word  length  of  the  equivalent  pairs 
should  have  values  in  powers  of  2.  It  is  preferable  to  have  equal  word  length  for  such 

equivalent  pairs.  The  values  of  number  of  terms  (problem  size)  can  be  adjusted  to  achieve 
this. 

For  example  in  the  equivalent  pair  7{8(F,)8)  8(if,) 8,  8}  and  C{a(Si)8,  3}  the  word 
lengths  are  same  but  the  number  of  terms  are  different.  The  equivalent  pair 
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NUMERIC  ALGORITHM  MAPPING  CONCEPT 

CO 

— ; • Cl  POP 

I :4  ..  PROCESSOR 

rt  — i-*  f2  I 

IP  HflCJ  \ J IP  HOCK  4 


Figure  2:  MAR  process-2  array  formation 

/{ie(5i)i0,  ie(X')i6,  4},  C'{16(5i)16,  s ( *5" 2 ) s j s{S3)s,  3}  has  different  word  length  and  prob- 
lem size.  But  C'{ie(5x)i6,  xa^ha,  wC^Jia,  3}  can  be  easily  decomposed  (by  proper  word 
length  and  term/problem  size  partitioning)  in  terms  of  C{x6(5x)x6,  sC^Si  sOS^a,  3},  That 
is  C{x6(5i)i8,  Tt}  can  be  decomposed  to  /{i6(Si)xe,  ia(-K*i)i6,  4}.  Further  details  on  this  is 
dealt  in  section  3.3. 

This  leads  to  a unified  PACUBE  VLSI  array  for  executing  Inner  product  and  Chain 
multiplication  operations.  Outer  product  operation  and  high  speed  multiplicative  division 
algorithm  [3]  are  based  on  chain  multiplication  operation.  Hence  the  execution  processes 
of  GIPOP  functions  can  be  unified  on  the  PACUBE  macrocell  arrays. 


2.5  Systolic  Mapping  Of  GIPOP  Functions 

An  algorithm  has  been  developed  for  automatic  systolic  mapping  of  GIPOP  function . ex- 
ecution on  the  P-arrays  and  implemented  under  DOS.  (Fig  3).  The  unification  of  the 
execution  processes  of  the  GIPOP  functions  has  greatly  simplified  the  development  of 
systolic  mapping  tools. 
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Figure  3:  Systolic  mapping  of  GIPOP  functions  on  Pacube  arrays 

3 GIPOP  Processor  Array  And  Software  Tools  For 
Algorithm  Mapping 

3.1  GIPOP  Processor 

Existing  processor  arrays  meant  for  supercomputing  are  classified  into  special  purpose 
and  programmable  general  purpose  arrays.  An  attempt  to  combine  the  advantages  of 
t ese  approaches  has  culminated  in  a novel  processor  design  approach,  namely  GIPOP 
processor  arrays.  This  novel  approach  is  expected  to  give  very  high  performance/cost 
ratio  for  supercomputing  systems  [2]. 

The  internal  architecture  of  the  GIPOP  processor  and  its  instruction  set  is  shown  in 
Figure  4.  The  simulation  of  the  instruction  set  has  been  completed.  The  GIPOP  processor 
can  execute  the  equivalent  pair  J{3(S;)4,  3(^)0  4}  and  C'{3(5,,)4,  3}. 

The  equivalent  pairs  are  chosen  based  on  the  PACUBE  macrocell,  GIPOP  processor 
and  the  array  complexities. 


3.2  GIPOP  Processor  Array 

Two  levels  of  pipelining  take  place  in  algorithm  execution  on  the  GIPOP  processor  array 
partly  shown  in  Figure  4 ,one  at  the  macro-cell  level  within  the  GIPOP  chip  and  the  other 
at  the  processor  level.  The  Processor  level  pipelining  is  controlled  by  the  Array  Control 
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GIPOP  PROCESSOR 
ARRAY  WITH 
SWITCHES 


GIPOP  INSTRUCTION  SET 

INSTRUCTION 

FORMAT 

nou 

DESTINATION 

, SOURCE  , VICTOR  CYCLES 
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OPERANDS 
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GPP  , 

m , 

n 

Pipelining  of  data  from  memory 
or  input  port  through 

m 

1 

general  purpose  registers. 

no  v 

cn 

n 

Loads  three  four-bit  operands 
for  chain  multiplication  operation 
either  from  memory  or  input  port. 

nou 

cn 

1 

nou 

ip 

n 

Loads  eight  four-hit  inner^product 
operands  from  memory 

nay 

ip 

1 

or  input  port. 

nou 

PAC 

n 

Load?  eight  four^bit 
massive  array  operands 
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PAC 

SPOF 

, Accumulate  the  output  of  GIPOP 

, in  the  primary  accumulator* 

nou 

ALT 

n 

i Loads  four-tit  multiplier  from 
, memory  to  scalar  multiplier. 

nou 

ni: 

FAC 

; Loads  the  multiplicand  from  the 
5 accumulator  to  scalar  multiplier. 

nou 

0 

, GPR 

Transfer  the  ouput 

nou 

0 

, GP  OF 

of  general  purpose  registers, 
GIPOP  , primary  accumulator  , 
and  the  scalar  multiplier 
to  the  output  port.  (These 

nou 

0 

, PAC 

loadings  can  be  done  in 

nou 

0 

, nn 

parallel). 

[1 3(504,  z(Ki)4t  C 3(51)4,3(52)4,3(53)4, 3)']" 

Figure  4:  GIPOP  processor  architecture  for  equivalent  pair 
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Unit  (ACU)  and  the  macrocell  level  pipelining  is  controlled  by  the  chip  control  unit  (CCU). 
The  hardware  complexity  of  the  switch  lattice  depends  on  the  processor  complexity,  data 
communication  complexity  in  an  algorithm  and  the  word  length.  The  design  of  the  ACU 
is  under  progress. 


3.3  Mapping  Of  Numeric  Algorithms  On  GIPOP  Processor  Ar- 
ray 

Mapping  of  numeric  algorithms  on  the  GIPOP  processor  array  involves  tackling  the  com- 
putational (levels  1 & 2)  and  communicational  (level  3)  complexities. 

Level  1.  Decomposing  the  computational  structure  of  the  algorithm  in  terms  of 
CIPOP  equations  which  are  further  decomposed  in  terms  of  the  inner 
product  functions  only  [1], 

Level  2.  The  architectural  capabilities  of  a GIPOP  processor  is  bounded  by  prob- 
lem size  and  wordlength.  Suitable  algorithms  for  problem  size  and  word 
length  partitioning  of  GIPOP  functions  and  the  corresponding  software 
tools  have  been  developed.  The  algorithm  is  based  on  definitions  (3)  - 
(6)  [].  Refer  to  Figure  5. 

Level  3.  a.  Proper  loading  of  input  operand  frames  into  the  high  speed  processor 

memory  (Block  level  memory  loading)  from  the  system  memory, 
b.  Programming  the  processor  and  array  control  units. 

Numeric  algorithm  mapping  on  the  GIPOP  arrays  basically  involve  mapping  of  different 
inner  product  blocks  (IP  Blocks)  of  variable  complexities  (see  Fig  2c).  Functionally  each 
of  these  IP  Blocks  may  correspond  to  different  GIPOP  operations. The  data  flow  between 
the  corresponding  group  of  GIPOP  processors  (making  an  IP  block)  can  be  syntactically 
described  including  both  the  functional  and  behavioral  aspects  of  the  GIPOP  processor 

[2]. Mapping  of  an  algorithm  on  the  arrays  is  to  get  this  syntactical  description  of  the  data 
flow. 

Software  tools  for  levels  1 and  3 are  being  developed.  The  novel  concept  of  mapping 
numeric  algorithms  on  the  GIPOP  processor  array  shown  in  Figure  2c  greatly  simplifies 
the  development  of  software  tools  for  levels  1 and  3 above. 


4 PACUBE  Logic  Level 

4.1  Inter  And  Intra  Cell  Routing. 

An  efficient  algorithm  for  inter  and  intra  macrocell  routing  has  been  developed  taking  into 

account  the  shortest  path  considerations.  The  software  tools  are  shown  in  Figure  6 as 
flowcharts. 
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Chain  Multiplication  Inner  Product 

Figure  5:  Wordlength  & term  partitioning 

4.2  Subprogram  (Functional  Units)  Library  Generation  And  Link 
ing. 

Several  subprograms,  sequential  and  combinatorial, have  been  mapped  on  the  PACUBE 
macrocell.  An  efficient  PACUBE  Hardware  Description  Language  (PHDL)  has  been  de- 
veloped. The  subprogram  finking  is  done  using  this  PHDL  and  the  related  software  tool 
has  been  developed.  ( Fig.  7 ) 

5 PACUBE  Circuit  Level  Discussions 

5.1  Interactive  Layout  Generation  And  Checking. 

A software  tool  for  the  generation  of  device  level  layouts  in  an  interactive  fashion  has 
been  developed  (Fig.  8).  It  provides  four  layers  viz  Diffusion,  Polysificon  and  two  levels 
of  metal.  The  package  supports  both  n-well  and  p-well  CMOS  processes.  Lambda  based 
design  rules  have  been  adopted.  Special  facilities  such  as  mirroring,  translation,  rotation, 
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Figure  6:  Inter-cell  logic  level  tool 

step  &:  repeat  , cut  & paste  and  scaling  are  available  to  aid  in  faster  design.  The  layout  of 
an  entire  macrocell  has  been  developed  using  this  tool.  Layout  geometries  are  expressed 
using  PACUBE  device  level  codes.  The  design  rule  checker  developed  is  edge  based  and 
performs  both  inter  and  intra  layer  checking. 

The  different  functional  units  of  the  macrocell  are  treated  as  standard  cells  and  de- 
pending on  the  application  the  required  standard  cells  can  be  placed  within  the  macrocell. 
This  option  gives  rise  to  macrocell  arrays  with  different  sizes  of  macrocells. 

5.2  Intracell  And  Intercell  Mapping. 

A software  translator  for  the  PACUBE  logic  level  code  to  PACUBE  device  level  code 
conversion  and  a device  level  to  Caltech  Intermediate  Format  (CIF)  translator  has  also 
been  developed. 
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Figure  7:  Intra-cell  logic  level  tool 


5.3  Simulation 

Circuit  simulation  of  different  functional  units  of  the  macrocell  for  1.2  micron  technology 
using  PSPICE  is  nearing  completion.  Logic  level  simulation  of  PACUBE  macrocell  for 
GIPOP  operations  has  been  carried  out  using  the  PACUBE  logic  simulator. 


6 Conclusion 

In  this  paper  novel  methodologies  have  been  proposed  for  designing  silicon  compilers  for 
synthesising  supercomputing  VLSI  systems.  The  important  criteria  for  such  a silicon 
compiler  are  integration  of  its  software  tools  and  the  architectural  considerations. 
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Figure  8:  Sub-program  linker 


7 Future  Work 
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MAIN  PROGRAM  ; 


Figure  9:  Pacube  macro  cell  based  device  level  layout  editor 
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A Fast  Adaptive  Convex  Hull  Algorithm  on 
Two-Dimensional  Processor  Arrays 
with  a Reconfigurable  BUS  System1 

S.  Olariu 

J.  Schwing,  and  J.  Zhang 
Department  of  Computer  Science 
Old  Dominion  University 
Norfolk,  VA  23529-0162 
U.S.A. 

Abstract-  A bus  system  that  can  change  dynamically  to  suit  computational  needs 
is  referred  to  as  reconfigurable.  We  present  a fast  adaptive  convex  hull  algo- 
rithm on  a 2-dimensional  processor  array  with  a reconfigurable  bus  system 
(2-d  PARBS,  for  short).  Specifically,  we  show  that  computing  the  convex  hull 
of  a planer  set  of  n points  taken  0(logn/logm)  time  on  a 2-d  PARBS  of  size 
mu  x n with  3 < m < n.  Our  result  implies  that  the  convex  hull  of  n points  in 
the  plane  can  e computed  in  0(1)  time  in  a 2-d  PARBS  of  size  n1,5  x n. 


1 Introduction 

Recent  advances  in  VLSI  have  made  it  possible  to  build  massively  parallel  machines  featur- 
ing many  thousands  of  cooperating  processors.  This  increase  in  computational  power  does 
not,  however,  translate  into  increased  performance  of  the  same  order  of  magnitude.  One 
of  the  reasons  seems  to  be  that  interprocessor  communications  and  simultaneous  memory 
accesses  often  act  as  bottlenecks  in  parallel  machines. 

To  alleviate  the  inefficiency  of  long  distance  communication  among  processors,  bus 
systems  have  been  recently  added  to  a number  of  parallel  machines  [2-4,5,6,11].  If  such  a 
bus  system  can  be  dynamically  changed,  under  program  control,  to  suit  communication 
needs  among  processors,  it  is  referred  to  as  reconfigurable.  Examples  include  the  bus 
automaton  [11],  the  reconfigurable  mesh , and  the  polymorphic  torus  [2,3],  among  others. 

The  computational  model  used  throughout  this  work  is  the  reconfigurable  mesh  [5]. 
An  m x n reconfigurable  mesh  (also  called  a PARBS  [13])  consists  of  m X n identical 
processors  positioned  on  a rectangular  array  (refer  to  Figure  1).  The  processor  at  (t,;), 
(1  < i < m;  1 < j < n)  is  identified  by  P(i,j).  Every  processor  has  4 ports  denoted  by  N , 
S,  E , and  W.  There  are  also  implicit  north,  south , east,  and  west  directions  (refer  to  Figure 
1).  In  each  processor,  ports  can  be  dynamically  connected  in  pairs  to  suit  computational 
needs.  In  the  absence  of  these  local  connections,  the  PARBS  is  functionally  equivalent  to 
the  mesh  connected  computer. 

1This  work  was  supported  by  NASA  under  grant  NCC1-99  by  the  National  Science  Foundation  under 
grant  CCR-8909996  is  gratefully  acknowledged 
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Figure  1:  A 4x5  PARBS 

We  assume  that  each  processor  has  a small  number  of  registers  of  size  O(log  n ) bits  and 
that  a processor  can  perform  in  unit  time  standard  arithmetic  and  boolean  operations.  We 
assume  a single  instruction  stream:  in  each  time  unit  the  same  instruction  is  broadcast 
to  all  processors,  which  execute  it  and  wait  for  the  next  instruction.  Each  instruction 
can  consist  of  setting  local  connections  (as  explained  later),  performing  an  arithmetic  or 
boolean  operation,  broadcasting  a value  on  a bus,  or  receiving  a value  from  a specified 
bus.  The  regular  structure  of  the  PARBS  makes  it  suitable  for  VLSI  implementation.  In 
fact,  it  has  been  argued  [5]  that  the  PARBS  can  be  used  as  a universal  chip  capable  of 
simulating  any  equivalent-area  architecture  without  loss  of  time. 

By  adjusting  the  local  connections  within  each  processor  several  subbuses  can  be  es- 
tablished.  We  assume  that  the  setting  of  local  connection  is  destructive  in  the  sense  that 
setting  a new  pattern  of  connections  destroys  the  previous  one.  At  any  given  time,  only 
one  processor  can  broadcast  a value  onto  a bus.  Processors,  if  instructed  to  do  so,  read 
the  bus.  If  no  value  is  being  transmitted  on  the  bus,  the  read  operation  has  no  result. 
It  is  assumed  [5,6]  that  communications  along  buses  take  0(1)  time.  This  seems  to  be  a 
reasonable  assumption  in  the  light  of  recent  experiments  with  the  YUPPIE  system  [4], 

A number  of  problems  have  been  solved  in  0(1)  time  on  PARBS.  Very  recently,  Wang 
et  ai  [13]  have  proposed  0(1)  algorithms  for  the  transitive  closure  and  some  related  graph 
problems;  Olariu,  Schwing,  and  Zhang  [9]  have  proposed  an  adaptive  sorting  algorithm; 
specifically,  they  show  that  sorting  a sequence  of  n reals  takes  0(j^~)  time  on  a 2-d 
PARBS  of  size  nm  x n with  3 < m < n.  In  particular,  their  result  implies  a constant-time 
sorting  algorithm  on  an  n1-5  x n 2-d  PARBS. 

The  convex  hull  of  a set  of  points  in  the  plane  is  defined  as  the  smallest  area  convex 
set  that  contains  the  original  set.  The  problem  of  computing  the  convex  hull  of  points 
in  the  plane  is  central  in  a variety  of  problems  in  pattern  recognition,  computer  graphics, 
statistics,  and  image  processing  [1,7,8,10]. 
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To  the  best  of  our  knowledge,  no  convex  hull  algorithm  has  been  reported  in  the  lit- 
erature on  a 2-d  PARBS.  The  purpose  of  this  paper  is  to  propose  a fast  adaptive  convex 
hull  algorithm  for  a set  of  n points  in  the  plane.  We  reduce  the  problem  of  computing 
the  convex  hull  of  a set  of  planar  points  to  the  problems  of  sorting  and  computing  the 
prefix  maximum  of  n real  numbers.  To  begin,  we  show  that  the  problem  of  computing  the 
maximum  of  n real  numbers  can  be  solved  in  time  on  a 2-d  PARBS  of  size  m x n 

with  2 < m < n.  We  also  use  the  fast  adaptive  sorting  algorithm  of  [9].  What  results  is  a 
fast  adaptive  algorithm  that  computes  the  convex  hull  of  a set  of  n points  in  the  plane  in 
O(j^)  time  on  a 2-d  PARBS  of  size  nm  x n with  3 < m < n.  In  particular,  for  m=n0  5 
we  obtain  an  0(1)  time  convex  hull  algorithm  on  a 2-d  PARBS  of  size  n1'5  X n. 


2 The  stepping  stones 

Our  convex  hull  algorithm  relies  on  a number  of  intermediate  results  that  we  present  next. 
To  begin,  we  consider  the  problem  of  computing  the  prefix  maximum  of  n reals  on  a n x n 
PARBS.  Specifically,  given  n real  numbers  «i,  a2>. an  with  processor  P(l,y)  storing 
a.j , the  problem  is  to  compute  maxi<;<j{a;}  for  all  1 < j < n.  Our  algorithm  involves 
establishing  a number  of  subbuses  and  broadcasting  values  along  them.  The  details  of  our 
algorithm  are  spelled  out  by  the  following  sequence  of  steps. 


Algorithm  Prefix-Maximum; 

Step  1.  every  processor  P(i,j ) (2  < i < n — 1;  1 < j < n)  connects  its  ports  N and  5; 

Step  2.  every  processor  P(l,j)  (1  < j < n)  broadcasts  dj  southbound  along  the  vertical 

subbus  in  column  j ; 

Step  3.  every  processor  P(i,j ) (2  < i < n — 1;  2 < j < i)  connects  its  ports  W and  E ; 

Step  4.  every  processor  P(j,j ) (2  < j < n)  broadcasts  aj  westbound  along  the  horizontal 

subbus  in  row  j ; 

Step  5.  every  processor  P(j,i)  (2  < j < n — 1;  1 < i < j)  compares  a,-  and  a,; 
if  a;  > aj  then 

P(j,i ) disconnects  the  horizontal  subbus; 
marks  itself; 

Step  6.  every  marked  processor  P(i,j)  broadcasts  a ”0”  along  the  horizontal  subbus 
eastbound; 

Step  7.  every  processor  P(j,j)  (1  < j < n)  stores  in  its  own  memory  a ”0”  or  a ”1” 
depending  on  whether  or  not  it  has  received  a ”0”  in  Step  6; 

Step  8.  every  processor  P(i,j ) (2  < i < j < n)  connects  its  ports  N and  5; 
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Step  9.  every  processor  P(j,j)  (2  < j < n ) broadcasts  on  the  vertical  subbus  northbound 
the  value  it  has  stored  in  Step  7; 

Step  10.  every  processor  P(l,j)  (2  < j < n ) that  has  received  a ”0”  in  Step  9 connects 
its  ports  W and  E ; 

Step  11.  every  processor  P(ltj)  (1  < j < n ) that  stores  a ”1”  broadcasts  a y eastbound 
along  the  horizontal  subbus  in  row  1; 

Theorem  1.  Algorithm  Prefix-Maximum  correctly  computes  the  prefix  maximum  of  n 
real  numbers  in  0(1)  time  on  an  n x n PARBS. 

Proof.  To  begin,  note  that  in  Step  B,  every  processor  P(i,j)  (2  < t < » - 1;  1 < j < i) 
knows  a,-  and  ay.  Further,  it  is  easy  to  see  that  at  the  end  of  Step  7 processor  P(j,j ) 
(1  < j < n)  stores  a ”1”  if,  and  only  if,  a y is  as  least  as  large  as  a,  with  i < j. 

Consequently,  every  processor  P(l,j)  in  row  1 that  at  the  end  of  Step  9 stores  a ”0” 
knows  that  ay  cannot  be  the  prefix  maximum  of  a j for  i < j.  In  fact  the  prefix  maximum 
of  the  first  j real  numbers  ax,  a2,. . .,ay  is  stored  by  the  first  processor  to  the  left  of  P(l,  j) 
that  stores  a ”1”.  The  conclusion  follows.  □ 

Next,  we  show  how  to  compute  the  maximum  of  n real  numbers  a2,  a2,...,a„  on  an 
m x n PARBS  with  2 < m < n.  Again,  we  assume  that  the  numbers  are  stored  one  per 
processor  such  that  for  all  j (1  < j < n),  P(l,  j)  stores  ay.  The  idea  of  our  algorithm  is  to 
partition  the  original  m x n PARBS  into  subPARBS  of  size  m x m.  To  avoid  tedious  but 
inconsequential  housekeeping  details  we  assume  that  n is  a power  of  m. 

We  partition  the  n columns  into  contiguous  groups  of  m columns  each  and  let  the  k-t h 
subPARBS,  Mfc,  (0<Jfc<  n/m  — 1)  consist  of  the  columns  km,  + 1,  km  -}-  2,. . .,km  -f  m. 
As  a preprocessing  step,  for  all  j (2  < j < n ) we  move  the  data  contained  in  P(l,j)  to  the 
’’diagonal”  processor  of  its  mxm  subPARBS,  P((j  - 1)  mod  m + l,y).  The  main  loop  of 
this  algorithm  applies  the  (prefix)  maximum  algorithm  described  above  to  specified  mxm 
subPARBS.  This  process  proceeds  iteratively,  determining  the  maxima  of  groups  of  size 
m,  m2,  m3,  and  so  on.  Clearly,  in  logm  n=  iteration  we  have  computed  the  maximum 
of  the  n numbers. 

We  omit  the  details  oTbus-construction  steps  which  are  similar  to  those  in  the  previous 
algorithm.  The  reader  can  easily  fill  in  the  details. 


Algorithm  Maximum; 


Step  1.  {preprocessing} 

for  all  j (1  < j < n ) in  parallel 

establish  a vertical  subbus  from  P(l,j)  to  P((j  — 1)  mod  m -f  1,  j); 
P(  IJ)  broadcasts  ay  on  this  subbus  to  P((j  — 1)  mod  m + 1,  j); 
P((j  — 1)  mod  m + 1,  j)  marks  itself 

endfor; 
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Step  2.  {main  loop} 

for  k <—  1 to  do 

logm 

for  all  j (1  < j < jjjJj-)  in  parallel 

all  processors  connect  ports  W and  E] 

all  processors  P(i,  ( j — l)mfc  + 1)  split  the  horizontal  subbus  in  row  i\ 
all  marked  processors  broadcast  the  value  they  hold 
along  the  horizontal  subbus  westbound; 
all  marked  processors  unmark  themselves; 

computes  the  maximum  of  the  values 
in  column  ( j — 1 )mk  + 1; 

let  the  result  be  stored  in  P((j  — 1)  mod  m -f  l,(j-l)m*  + l); 
all  processors  P((j  — 1)  mod  m + 1,  (j  — l)mfc  + 1)  mark  themselves 

endfor 

endfor; 

Theorem  2.  Algorithm  Maximum  correctly  computes  the  maximum  of  n real  numbers 
in  0(jsj£)  time  on  an  m x n PARBS  with  2 < m < n. 

Proof.  The  correctness  is  implied  by  the  following  result:  at  the  end  of  the  t-th  iteration 
(0  < * < for  all  j (1  < j < ^r),  processor  P((j  - 1)  mod  m + 1,  (j  - 1 )mt  + 1) 

contains  the  maximum  in  columns  (j  — 1 )mt  + 1 through  jm1. 

The  proof  of  the  above  statement  is  by  induction  on  i.  The  basis  is  easy:  at  the  end  of 
the  0-th  iteration  the  conclusion  is  guaranteed  by  the  preprocessing  step. 

Assume  the  above  statement  satisfied  at  the  end  of  the  f-th  iteration.  We  only  need 
show  that  it  also  holds  at  the  end  of  the  (t-fl)-st  iteration.  For  this  purpose,  it  is  instructive 
to  follows  the  ( t + l)-st  iteration:  here,  after  all  processors  connect  their  ports  W and  E 
thus  establishing  horizontal  subbuses  in  each  row,  the  processors  P[i , (j  — l)m<+1  + 1)  split 
the  horizontal  subbus  in  row  i;  next,  all  marked  processors  broadcast  the  value  they  hold 
along  the  horizontal  subbus  westbound.  By  the  induction  hypothesis,  these  are  processors 
P{(j  — 1)  mod  m + 1,  (j  — 1 )mt  + 1).  Therefore,  when  the  subPARBS  M(_,_  compute 
the  maximum  of  the  values  in  column  (j  — 1 )mk+l  + 1,  the  induction  hypothesis  guarantees 
that  the  resulting  value  is  the  maximum  in  columns  ( j - l)mm  + 1 through  a total 

of  mt+1  columns. 

To  argue  for  the  running  time,  note  that  by  Theorem  1,  the  inner  for  loop  runs  in  0(1) 
time.  The  conclusion  follows.  □ 


3 The  Algorithm 

We  are  now  in  a position  to  present  our  planar  convex  hull  algorithm.  Let  S={pi,  pi,. . .,pn} 
be  aplanar  set  of  points;  for  1 < i < n,p;  is  represented  by  its  Cartesian  coordinates  (x,-,y,). 
To  avoid  tedious  details  we  assume,  without  loss  of  generality,  that  the  points  in  S are  in 
general  position,  with  no  three  collinear  and  no  two  having  the  same  x or  y coordinate. 
The  output  of  the  convex  hull  algorithm  is  a linked  list  CH  that  contains  all  the  points 
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on  the  convex  hull  starting  with  the  one  with  the  largest  x coordinate  and  proceeding 
counterclockwise.  Our  algorithm  consists  of  the  following  sequence  of  steps. 

Algorithm  Convex- Hull; 


Step  1.  find  the  four  extremal  points  in  5,  and  let  them  be,  without  loss  of  generality,  pi, 
p2,  p3,  and  p4.  Specifically,  x1=maxi<J<n{x;,},  y2=raax1<J<n{pj},  x3=mini<y<n{xJ}, 
and  y4=minl<J<n{yJ}. 

Step  2.  compute  the  sets 

Si  = {pi|x2  < *i  < *152/1  < Vi  < y2}, 

S7  = {pi|x3  < *i  < *2;P3  < Vi  < Vi}, 

53  = {p,- 1*3  < *i  < *4;y<  < yi  < ys}, 

5 4 = {p^*4  < *»  < *i;i/4  < y«  < yi}. 

Note:  For  simplicity,  we  deal  with  Si  only,  the  others  being  perfectly  similar. 

Step  3.  sort  the  points  in  Si  by  increasing  y coordinate,  and  let  ii=(pi=gi,  g2,. . -,yt=p2) 
be  the  resulting  sorted  sequence; 

Step  4.  for  all  j (1  < j < t)  in  parallel 

find  the  subscript  dj  ( j < dj  < t)  such  that  the  angle  determined  by  qjj , 
qj,  and  the  negative  direction  of  the  x axis  is  as  large  as  possible; 

Step  5.  compute  the  prefix  maximum  of  the  values  dj  in  Li,  and  set  m(j)  «— maxi  <t<"/-i{4}; 

Step  6.  C Hi  ♦ — 

for  all  j (2  < j < t — 1)  in  parallel 

remove  qj  from  CHi  whenever  dj  < m(j)\ 

Before  giving  the  proof  of  correctness  of  our  algorithm,  we  need  to  take  note  of  the 
following  simple  observation.  The  sorted  sequence  Li  of  points  obtained  at  the  end  of  Step 
3 can  be  viewed  as  determining  a polygonal  line  [termed  a chain  in  ]10])  joining  pi  and  p2. 

It  is  easy  to  see  that  the  convex  hull  CH  of  the  set  5 of  points  is  exactly  the  convex  hull 
of  the  simple  polygon  P obtained  by  concatenating  the  polygonal  lines  X2,  L3,  and  T4, 
in  this  order. 

The  following  result  argues  for  the  correctness  of  our  algorithm. 

Theorem  3.  At  the  end  of  Step  6,  CHi  contains  the  portion  of  the  convex  hull  contained 
in  Si. 

Proof.  By  the  previous  observation  we  only  need  show  that  the  linked  list  CH  1 obtained 
at  the  end  of  step  6 contains  the  restriction  of  the  convex  hull  of  P between  and  p2. 
This  follows  from  the  following  claim 

a point  qj  (2  < j < t — 1)  of  Li  belongs  to  CH  if,  and  only  if,  dj  > m(j). 
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First,  let  qj( 2 < j < t — 1)  in  L\  belongs  to  the  convex  hull  and  let  qi  and  qk  (i  < j < k) 
be  its  immediate  neighbors  on  the  convex  hull.  (We  note  that  since  q\  and  qt  trivially 
belong  to  the  convex  hull,  the  points  g,-  and  q k are  well  defined.)  Clearly,  d<—j  and  so 
m(j)  = j < dj  = k,  as  claimed. 

Conversely,  if  some  point  qj  in  L\  does  not  belong  to  the  convex  hull  then  let  g,  and  qk 
( i < k ) be  the  closest  points  on  the  convex  hull,  with  qj  lying  on  the  chain  from  qi  to  g*. 
Since  qi  and  qk  are  neighbors  on  the  convex  hull,  we  have  di—k\  furthermore,  dj  < k = m(j), 
and  the  conclusion  follows.  □ 

Next,  we  propose  to  show  how  Steps  1-6  above  can  be  efficiently  implemented  on  a 
2-d  PARBS.  More  precisely,  we  assume  a 2-d  PARBS  of  size  nm  x n with  3 < m < 
n.  Some  of  the  Steps  1-6  in  our  algorithms  need  the  whole  PARBS  while  others  can 
run  on  a subPARBS,  as  specified;  the  data  movement  necessary  to  conform  to  the  input 
requirements  of  a specific  step  are  ignored  here;  the  reader  can  easily  work  out  all  the 
details. 

Step  1 can  be  implemented  to  run  in  0(1)  time  on  an  n x n subPARBS  since  we  only 
need  compute  maxi<j<n{z,-}  and  mini <_,•<„ {zj}  with  z = x and  z — y. 

Step  2 is  demonstrated  for  S\  only;  computing  Si  with  i ~ 2,3,4  is  similar.  All  that 
is  needed  is  to  establish  a subbus  running  through  the  whole  of  row  1.  The  processors 
storing  pi  and  p2  broadcast,  in  two  computational  steps,  their  Cartesian  coordinates  to  all 
processors  in  row  1;  every  processor  that  stores  a point  in  S\  marks  itself.  Thus  Step  2 
runs  in  0(1)  time. 

Step  3 can  be  implemented  as  follows.  First,  all  unmarked  processors  change  the  y 
coordinate  of  the  point  that  they  store  to  +oo.  Now  the  sorting  algorithm  in  [9]  is  invoked: 
this  runs  in  O(j^KIL)  and  uses  the  whole  PARBS.  Note  that  at  the  end  of  Step  3,  processors 
P(l,  1),  P(l,  2),. . .,P(1,  i)  contain  L in  sorted  order. 

Step  4 can  be  implemented  to  run  in  0(1)  time  on  an  rrm  x n subPARBS  as  follows. 
Recall  from  Step  3 that,  initially,  for  all  1 <j<t  P(l,  j)  stores  qj.  For  further  reference, 
this  subPARBS  is  further  subdivided  into  subPARBS  of  size  m x n as  follows.  The  first 
m x n subPARBS  involves  the  first  m rows,  the  second  the  next  m rows  and  so  on. 
We  establish  vertical  subbuses  in  each  column  and  let  P(l,  j)  broadcast  the  Cartesian 
coordinates  of  qj  along  the  subbus  in  column  j (1  < j < f).  Next,  establish  horizontal 
subbuses  running  from  P(m(jr  — 1)  + 1,  j)  to  P(m(j  — l)  + l,f)  (1  < j < t ).  Note  that  these 
are  precisely  the  first  rows  of  our  mxn  subPARBS.  For  all  P(m(j  — 1)  + 1,  j)  broadcasts 

the  Cartesian  coordinates  of  qj  eastbound  on  the  horizontal  subbus  in  row  m(j  — 1)  + 1- 
Every  processor  P(m(j  — 1)  + 1 ,k)  with  j < k < t computes  the  angle  specified  in  Step  4. 
Actually,  computing  the  angle  itself  is  not  necessary,  the  tangent  of  the  angle  can  be  readily 
computing  using  two  subtractions  and  a division.  Now  the  maximum  of  all  values  in  the 
first  rows  of  these  subPARBS  can  be  computed  in  0(j^—)  time  using  Algorithm  Maximum 
developed  in  Section  2.  It  is  easy  to  arrange  for  the  maximum  in  row  m(j  — 1)  + 1 to  be 
sent  back  to  P(l,j).  This,  clearly  takes  0(1)  time  since  only  the  appropriate  subbuses 
have  to  be  established  and  the  information  broadcast  along  them. 

Step  5 can  be  implemented  to  run  on  an  n X n subPARBS  by  using  Algorithm  Prefix- 
Maximum  discussed  in  Section  2. 
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Step  6 involves  marking  every  P(l,  j)  that  contains  a point  of  the  convex  hull,  After 
this  is  done,  a horizontal  subbus  is  established  in  row  1.  Every  marked  processor  splits 
this  bus  and  broadcasts  its  identity  westbound  on  its  own  subbus.  This,  in  fact  creates 
the  list  CHi  as  desired.  Clearly,  the  running  time  of  this  step  is  0(1). 

To  summarize  our  discussion  we  state  the  following  result. 

Theorem  4,  The  convex  hull  of  a planar  set  of  n points  can  be  computed  on  an  PARBS 
of  size  nm  x n with  3 < m < n in  0(j^^)  time.  □ 

In  particular,  if  m = n0  6 then  we  have  the  following  result. 

Corollary  4.1,  The  convex  hull  of  a planar  set  of  n points  can  be  computed  in  0(1)  time 
on  an  PARBS  of  size  n1-5  x n.  □ 


4 Conclusion 

A bus  system  that  can  be  dynamically  altered  to  suit  communicational  needs  among  co- 
operating processors  is  referred  to  as  reconfigurable.  In  this  paper  we  a fast  adaptive 
algorithm  to  solve  the  planar  convex  hull  problem. 

Specifically,  we  showed  that  computing  the  convex  hull  of  a set  of  n points  in  the  plane 
takes  0(££~)  on  a 2-d  PARBS  of  size  nm  x n with  3 ?£  m < n.  In  particular,  our  result 
implies  that  the  same  problem  can  be  solved  in  0(1)  time  on  a 2-d  PARBS  of  size  n15  x n. 
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1 Introduction 

Hardware  description  languages  such  as  VHDL  have  evolved  to  aid  in  the  design  of  systems 
with  large  numbers  of  elements  and  a wide  range  of  electronic  and  logical  abstractions.  For 
high  performance  circuits,  behavioral  models  may  not  be  able  to  efficiently  include  enough 
detail  to  give  designers  confidence  in  a simulation’s  accuracy.  One  option  is  to  provide 
a link  between  the  VHDL  environment  and  a transistor  level  simulation  environment. 
The  coupling  of  the  Vantage  Analysis  Systems  VHDL  simulator  and  the  NOVA  simulator 
provides  the  combination  of  VHDL  modeling  and  transistor  modeling. 

2 Vantage  VHDL  Simulator 

The  Vantage  Analysis  Systems  VHDL  simulation  environment  is  a full  implementation  of 
the  IEEE  1076  VHDL  Standard.  The  Vantage  system  is  entirely  written  in  “C”.  Hierar- 
chical designs  from  Mentor  Graphics’  NetEd,  EDIF  or  other  schematics  can  be  imported 
into  Vantage.  The  Vantage  system  compiles  VHDL  designs  and  simulates  their  behavior 
either  interactively  or  in  a batch  mode.  Incremental  symbol  or  schematic  changes  can 
be  made  in  the  Vantage  environment,  from  which  structural  VHDL  can  be  automatically 
be  generated.  The  created  or  edited  schematics  can  be  exported  back  into  the  original 
schematic  environment.  Connectivity  and  structural  checks  are  made  by  the  schematic 
viewer. 

Results  may  be  viewed  as  waveforms  or  as  entries  in  a table,  and  can  be  viewed  as  the 
simulator  is  running  or  after  the  simulation  run  has  completed.  Circuit  node  values  can 
also  be  displayed  on  the  associated  node  in  the  circuit  schematic. 

The  Vantage  simulation  control  gives  the  user  source  code  level  breakpoints  and  trigger- 
ing capability  based  on  a very  wide  range  of  conditions  specified  by  the  user.  Convenient 
viewing  and  the  manipulation  of  the  values  of  signals,  variables  and  constants  is  provided. 
Breakpoints  can  be  based  on  boolean  expressions,  change  in  a signal,  source  code  lines,  or 
by  design  units  (by  instance  or  globally). 

In  the  Vantage  system,  the  VHDL  source  code  is  parsed  by  a C code  generator.  Then 
the  host  “C”  code  compiler  prepares  the  executable  file  or  files.  The  Vantage  Intermediate 
Format  is  used  for  the  generation  of  the  UC”  code.  All  design  units  lower  in  hierarchy 
must  have  older  time  stamps  than  the  designs  that  reference  them.  The  Vantage  system 
automatically  recomiles  all  “out-of-date”  design  units  referenced  during  a recompile. 

Several  conveniences  are  provided  by  the  Vantage  system,  including  automatic  mapping 
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of  signal  names  that  do  not  conform  to  the  VHDL  signal  name  convention  to  an  internal, 
VHDL  compatible  form.  This  permits  familiar  names  of  existing  systems  to  be  used  when 
interfacing  with  the  Vantage  simulator.  The  Vantage  Control  Language  facilitates  the 
generation  of  test  vectors. 

Extensive  libraries  of  vendor  supplied  VHDL  models,  parts  and  packages,  from  SSI  to 
VLSI,  are  available.  Any  VHDL  in  a Vantage  library  can  be  exported  to  an  ascii  file. 
Vantage  also  supplies  a concurrent  compiler  that  spreads  the  compilation  task  of  a design 
across  a network,  to  improve  compilation  speed. 


3 The  NOVA  Simulator 

NOVA  is  a logic  simulator  that  was  recently  developed  at  the  University  of  Idaho  NASA 
Space  Engineering  Research  Center  for  VLSI  Systems  Design.  NOVA  is  a second  generation 
design,  targeted  for  designs  of  up  to  a few  million  primitives  (transistors  and  logic  gates). 
NOVA  has  been  used  to  simulate  integrated  circuits  designed  for  the  NASA  Space  Station 
and  Explorer  missions  and  other  NASA  projects,  and  for  Hewlett  Packard  disk  and  tape 
drives.  NOVA  presently  is  ported  to  HP  9000  Series  300,  400  and  700’s,  the  HP/ Apollo 
DN10000,  the  Cray  X-MP  and  NeXT  systems.  Behavioral  models  can  be  used  in  NOVA 
to  assist  in  the  architectural  definition  of  major  functional  blocks  before  circuit  details 
are  completely  known,  and  to  improve  simulation  performance  at  most  levels  of  system 
modeling. 

Structural  description  is  accomplished  using  the  BOLT  or  HP  Block  Description  (BDL) 
languages.  NOVA  utilizes  hierarchical  design  methodology,  allowing  designs  to  be  conve- 
niently partitioned.  Efficient  management  of  design  complexity  is  made  possible  by  the 
block  oriented  circuit  description.  S CLP  schematics,  based  on  Hewlett  Packard  design 
tools,  provide  schematic  documentation,  from  which  the  BOLT  design  description  can  be 
extracted.  

NOVA  supports  synchronous  and  asynchronous  modeling  of  hierarchical  designs  using 
logic  primitives  and  intrinsic  devices.  Most  types  of  transistors  and  logic  gates  are  rep- 
resented in  the  existing  library,  including  bidirectional  CMOS  devices  and  bidirectional 
transmission  gates.  If  the  designer  requires  new  primitives  to  accurately  model  a special 
circuit,  NOVA  provides  a means  of  incorporating  the  user  defined  model. 

NOVA  provides  for  full  timing  analysis  of  combinational  and  sequential  circuits,  spec- 
ified by  rise  and  fall  delays,  using  a timing  wheel  based  simulation  engine. 

Behavioral  modeling,  using  a “C”  based  functional  model  capability,  allows  the  de- 
signer to  generate  high  level  descriptions  of  a block  of  circuitry.  Productivity  is  improved 
by  allowing  the  designer  to  simulate  the  function  of  a block  before  the  detailed  circuit 
implementation  is  available.  Very  good  behavioral  modeling  performance  is  achieved  by 
compiling  the  functional  models  with  the  simulation  engine.  Transistor  or  logic  gate  cir- 
cuit models  can  be  mixed  with,  or  replace,  behavioral  models,  with  full  timing  and  delay 
modeling  capability. 

A configurable  XI 1 graphics  user  interface  assists  the  designer  in  viewing  and  interpret- 


3rd  NASA  Symposium  on  VLSI  Design  1991 


13.3.3 


ing  simulation  results.  Signals  and  busses  can  be  viewed  as  waveforms.  Trigger  conditions 
can  be  defined  to  find  specific  signal  relationships  in  the  simulation  output,  allowing  in 
depth  analysis  of  complex  events. 

NOVA  also  provides  numerous  analysis  features,  such  as  node  coverage,  simulation  de- 
bugging, output  formatting,  node  forcing,  simulation  state  logging  and  saving,  and  others. 

Software  tools  in  NOVA  greatly  simplify  test  vector  development  for  design  verification. 
A test  vector  programming  language  makes  it  possible  for  the  designer  to  develop  compact 
descriptions  of  complex  simulation  sequences. 

The  overall  capability  of  the  NOVA  simulator  closely  matches  that  of  many  commercial 
simulators.  A major  issue  is  the  fact  that  NOVA’s  behavioral  modeling  capability  is  not 
close  to  any  industry  standards,  like  VHDL,  which  makes  it  difficult  for  NOVA  users  to 
leverage  existing  model  libraries  or  to  import  existing  designs.  On  the  other  hand,  the 
transistor  level  modeling  in  NOVA  is  highly  evolved  and  well  known  by  the  NOVA  user 
community.  Using  VHDL  as  a modeling  language  will  likely  open  many  opportunities 
for  an  organization  like  the  NASA  SERC  for  VLSI  Systems  Design,  compared  to  using  a 
proprietary  modeling  language. 

4 Multiple  Value  Logic  Systems 

Simulations  performed  at  the  logic  level  of  abstraction  describe  a digital  circuit  in  terms  of 
primitive  logic  functions  such  as  NAND,  NOR,  etc.,  and  allow  for  the  nets  interconnecting 
the  logic  functions  to  carry  states  of  zero,  one,  unknown,  and  high  impedance.  In  the 
case  of  NOVA,  strengths  of  active,  resistive  and  floating  accompany  the  logic  states,  to 
provide  a total  of  twenty-two  logic  values.  These  twenty-two  logic  values  are  built  into  the 
structure  of  the  simulator  and  are  possible  to  change  or  expand,  but  not  necessarily  easily. 

The  VHDL  standard  provides  a Multi- Value  Logic  structure  that  allows  the  individual 
user  of  VHDL  to  tailor  the  resolution  of  logic  values  to  satisfy  the  needs  of  a general  design 
methodology  or  the  specific  preferences  of  individuals.  This  flexibility  in  describing  logic 
values  in  a digital  system  can  have  a considerable  influence  on  the  transportability  of  a 
VHDL  model  from  one  design  group  to  another.  Having  too  few  logic  values  can  cause 
erroneous  results  in  hardware  systems  that  have  bidirectional  data  busses,  open-collector 
or  high-impedance  conditions.  More  logic  values  are  necessary  to  model  open-collector 

devices  with  pull-up  resistors  and  situations  that  can  occur  when  initializing  a digital 
system. 

For  the  coupling  of  NOVA  and  the  Vantage  VHDL  simulator,  the  model  compatibility 
issue  is  resolved  by  using  the  same  types  of  logic  states  and  strengths  for  both  simulation 
systems.  The  Multi- Value  logic  system  of  the  Vantage  system  is  set  to  represent  the  same 
values  as  NOVA,  with  the  resolution  functions  providing  the  same  logic  value  when  circuit 
outputs  are  connected  together. 
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5 Transistor  Level  Performance  Issues 

One  of  the  primary  motivations  for  this  work  is  the  acceleration  of  the  performance  of 
transistor,  or  switch  level,  simulations  in  the  VHDL  environment.  During  the  original 
design  of  the  VHDL  language,  transistor  level  simulation  was  not  included  as  a primary 
requirement.  However,  algorithms  have  been  developed  in  VHDL  that  can  simulate  the 
properties  of  bidirectional  transmission  gates,  without  extensions  to  the  VHDL  language. 

Given  that  VHDL  can  model  bidirectional  pass  transistor  networks,  a second  issue  is 
the  amount  of  memory  required  to  represent  a primitive  in  a VHDL  simulator  compared 
to  a more  “hardwired”  simulator  that  has  semantics  built  into  it’s  runtime  kernel,  such 
as  NOVA  or  Verilog.  NOVA  uses  about  35  bytes  per  primitive  (transistor)  in  the  internal 
representation.  The  amount  of  memory  required  for  the  average  primitive  in  VHDL  is  not 
as  easily  determined.  Based  on  overall  file  size,  it  appears  that  as  much  as  1000  bytes  of 
data  are  associated  with  each  primitive  in  the  Vantage  simulation  system,  a VHDL  system 
that  is  known  for  relatively  good  performance.  It  is  common  for  VHDL  system  models  to 
require  virtual  memory,  which  automatically  invokes  at  least  a 10X  performance  penalty, 
relative  to  a simulation  model  that  runs  entirely  in  RAM.  A 500,000  NOVA  transistor 
model,  entirely  composed  of  transistor  primitives  without  any  use  of  behavioral  models, 
will  fit  in  a workstation’s  32  megabyte  random  access  memory. 

Another  difference  between  the  Vantage  VHDL  modeling  system  and  NOVA  is  the 
use  of  resolution  functions.  A resolution  function,  applied  to  a node  in  a circuit,  is  used 
to  return  the  value  of  a signal  when  the  signal  is  driven  by  multiple  drivers,  during  a 
simulation.  All  VHDL  signals  with  multiple  drivers  roust  have  a resolution  function  tied  to 
that  signal.  With  VHDL,  the  designer  has  the  capability  of  defining  any  type  of  resolution 
function  desired,  either  wired-OR,  wired-AND  or  average  signal  value.  NOVA  has  the 
equivalent  of  a resolution  function,  but  it  is  coded  in  optimized  “C”,  tightly  linked  with 
the  rest  of  the  core  of  the  simulator  and  is  fixed  in  definition.  A VHDL  resolution  function 
is  written  in  VHDL  as  a package,  a representation  that  will  be  translated  into  C code 
but  not  optimized  for  fast  execution. 

6 “C”  Behavioral  Model  Interface  to  the  Vantage  Sim- 

ulator 

A complete  simulation  system  should  be  able  to  efficiently  and  quickly  incorporate  algo- 
rithms not  represented  in  the  native  language  of  the  simulator.  To  make  possible  fast 
development  of  designs,  algorithms  that  already  exist  in  program  form  should  be  usable 
in  system  simulations  without  requiring  a reimplementation.  If  a design  under  develop- 
ment uses,  for  example,  output  from  a digital  filtering  algorithm  that  is  going  to  be  part  of 
another  integrated  circuit,  it  may  be  of  considerable  advantage  to  use  a high  level  program- 
ming language  version  of  that  algorithm.  Developing  a new  implementation  of  a digital 
filtering  algorithm  is  not  only  a duplication  of  effort;  new  sources  of  error  and  changes  in 
performance  may  also  result. 
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Both.  NOVA  and  the  Vantage  Simulator  allow  “C”  based  behavioral  models  to  be 
compiled  into  each  simulation  environment.  In  NOVA,  a BOLT  description  is  written 
for  the  behavioral  model,  describing  the  input  and  output  connections  to  the  rest  of  the 
simulation  model.  The  NOVA  “C”  behavioral  model  is  compiled  in  “C”  and  the  resulting 
object  data  is  compiled  with  NOVA  into  a form  containing  the  regular  primitive  based 
simulation  environment  and  the  functional  model.  In  the  Vantage  VHDL  environment,  the 
process  is  similar,  in  that  the  user  must  provide  an  entity  (the  structural  or  input /output 
description)  written  and  compiled  in  VHDL  prior  to  compiling  the  combined  VHDL/C 
architecture.  The  architecture  (the  behavior  of  the  module)  is  written  and  compiled  in 
“C”.  Both  systems  provide  the  necessary  parameters  required  for  passing  state,  strength 
and  timing  information  between  the  behavioral  model  and  the  main  simulation  model. 
Again,  using  an  industry  standard  language,  such  as  VHDL,  as  the  modeling  interface, 
should  provide  more  flexibility  and  opportunity  in  the  future. 

7 “C”  Based  Simulator  Interface  to  the  Vantage  Sim- 

ulator 

Using  the  Vantage  Simulator  and  NOVA  is  one  way  of  meeting  the  dual  goals  of  using 
an  industry  standard  behavioral  modeling  language  and  achieving  decent  transistor  level 
simulation  performance.  It  is  presently  estimated  that  the  transistor  level  simulation  per- 
formance of  NOVA  will  exceed  that  of  the  Vantage  Simulator  by  20  to  50  times.  The 
software  to  accomplish  the  link  between  NOVA  and  the  Vantage  Simulator  is  expected  to 
be  available  from  Vantage  in  the  near  future. 


8 Future  Directions 

Research  is  in  progress  to  identify  simplier  behavioral  modeling  methodologies  that  are 
quicker  and  easier  to  use.  The  objective  is  to  reduce  design  time  by  having  only  one 
complete  representation  of  a design,  first  as  a top  level  behavioral  model  which  is  then 
broken  in  to  subsystems  of  the  design  as  the  function  of  each  block  is  identified.  At  any 
time,  either  the  behavioral  or  transistor  level  representation  of  a block  can  be  used.  For 
performance  reasons,  the  behavioral  models  can  be  used.  For  detailed  circuit  timing  and 
performance  analysis,  the  transistor  level  representations  of  the  blocks  being  designed  can 
be  used  while  the  rest  are  left  at  the  behavioral  level.  VHDL  is  satisfactory  as  the  modeling 
language  for  this  effort,  since  description  standards  will  be  a large  part  of  the  solution  to 
an  easier  to  use  simulation  environment.  In  a complementary  effort,  software  tools  are 
being  developed  by  the  University  of  Idaho  Computer  Science  Department  to  compare  the 
functionality  of  behavioral  models  and  transistor  level  models. 


